Split-plot designs find wide applicability in multifactor experiments with randomization restrictions. Practical considerations often warrant the use of unbalanced designs. This paper investigates randomization based causal inference in split-plot designs that are possibly unbalanced. Extension of ideas from the recently studied balanced case yields an expression for the sampling variance of a treatment contrast estimator as well as a conservative estimator of the sampling variance. However, the bias of this variance estimator does not vanish even when the treatment effects are strictly additive. A careful and involved matrix analysis is employed to overcome this difficulty, resulting in a new variance estimator, which becomes unbiased under milder conditions. A construction procedure that generates such an estimator with minimax bias is proposed.
{"title":"Causal Inference from Possibly Unbalanced Split-Plot Designs: A Randomization-based Perspective.","authors":"R. Mukerjee, Tirthankar Dasgupta","doi":"10.5705/SS.202020.0149","DOIUrl":"https://doi.org/10.5705/SS.202020.0149","url":null,"abstract":"Split-plot designs find wide applicability in multifactor experiments with randomization restrictions. Practical considerations often warrant the use of unbalanced designs. This paper investigates randomization based causal inference in split-plot designs that are possibly unbalanced. Extension of ideas from the recently studied balanced case yields an expression for the sampling variance of a treatment contrast estimator as well as a conservative estimator of the sampling variance. However, the bias of this variance estimator does not vanish even when the treatment effects are strictly additive. A careful and involved matrix analysis is employed to overcome this difficulty, resulting in a new variance estimator, which becomes unbiased under milder conditions. A construction procedure that generates such an estimator with minimax bias is proposed.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114394496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Model-assisted estimation with complex survey data is an important practical problem in survey sampling. When there are many auxiliary variables, selecting significant variables associated with the study variable would be necessary to achieve efficient estimation of population parameters of interest. In this paper, we formulate a regularized regression estimator in the framework of Bayesian inference using the penalty function as the shrinkage prior for model selection. The proposed Bayesian approach enables us to get not only efficient point estimates but also reasonable credible intervals. Results from two limited simulation studies are presented to facilitate comparison with existing frequentist methods.
{"title":"An Approximate Bayesian Approach to Model-assisted Survey Estimation with Many Auxiliary Variables.","authors":"S. Sugasawa, Jae Kwang Kim","doi":"10.5705/ss.202019.0239","DOIUrl":"https://doi.org/10.5705/ss.202019.0239","url":null,"abstract":"Model-assisted estimation with complex survey data is an important practical problem in survey sampling. When there are many auxiliary variables, selecting significant variables associated with the study variable would be necessary to achieve efficient estimation of population parameters of interest. In this paper, we formulate a regularized regression estimator in the framework of Bayesian inference using the penalty function as the shrinkage prior for model selection. The proposed Bayesian approach enables us to get not only efficient point estimates but also reasonable credible intervals. Results from two limited simulation studies are presented to facilitate comparison with existing frequentist methods.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123930316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-17DOI: 10.1920/WP.CEM.2019.2319
S. Lee, J. Horowitz
This paper describes a method for carrying out non-asymptotic inference on partially identified parameters that are solutions to a class of optimization problems. The optimization problems arise in applications in which grouped data are used for estimation of a model's structural parameters. The parameters are characterized by restrictions that involve the population means of observed random variables in addition to the structural parameters of interest. Inference consists of finding confidence intervals for the structural parameters. Our method is non-asymptotic in the sense that it provides a finite-sample bound on the difference between the true and nominal probabilities with which a confidence interval contains the true but unknown value of a parameter. We contrast our method with an alternative non-asymptotic method based on the median-of-means estimator of Minsker (2015). The results of Monte Carlo experiments and an empirical example illustrate the usefulness of our method.
{"title":"Non-asymptotic inference in a class of optimization problems","authors":"S. Lee, J. Horowitz","doi":"10.1920/WP.CEM.2019.2319","DOIUrl":"https://doi.org/10.1920/WP.CEM.2019.2319","url":null,"abstract":"This paper describes a method for carrying out non-asymptotic inference on partially identified parameters that are solutions to a class of optimization problems. The optimization problems arise in applications in which grouped data are used for estimation of a model's structural parameters. The parameters are characterized by restrictions that involve the population means of observed random variables in addition to the structural parameters of interest. Inference consists of finding confidence intervals for the structural parameters. Our method is non-asymptotic in the sense that it provides a finite-sample bound on the difference between the true and nominal probabilities with which a confidence interval contains the true but unknown value of a parameter. We contrast our method with an alternative non-asymptotic method based on the median-of-means estimator of Minsker (2015). The results of Monte Carlo experiments and an empirical example illustrate the usefulness of our method.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122750335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Methods for random-effects meta-analysis require an estimate of the between-study variance, $tau^2$. The performance of estimators of $tau^2$ (measured by bias and coverage) affects their usefulness in assessing heterogeneity of study-level effects, and also the performance of related estimators of the overall effect. For the effect measure log-response-ratio (LRR, also known as the logarithm of the ratio of means, RoM), we review four point estimators of $tau^2$ (the popular methods of DerSimonian-Laird (DL), restricted maximum likelihood, and Mandel and Paule (MP), and the less-familiar method of Jackson), four interval estimators for $tau^2$ (profile likelihood, Q-profile, Biggerstaff and Jackson, and Jackson), five point estimators of the overall effect (the four related to the point estimators of $tau^2$ and an estimator whose weights use only study-level sample sizes), and seven interval estimators for the overall effect (four based on the point estimators for $tau^2$, the Hartung-Knapp-Sidik-Jonkman (HKSJ) interval, a modification of HKSJ that uses the MP estimator of $tau^2$ instead of the DL estimator, and an interval based on the sample-size-weighted estimator). We obtain empirical evidence from extensive simulations of data from normal distributions. Simulations from lognormal distributions are in a separate report Bakbergenuly et al. 2019b.
{"title":"Simulation study of estimating between-study variance and overall effect in meta-analyses of log-response-ratio for normal data","authors":"Ilyas Bakbergenuly, D. Hoaglin, E. Kulinskaya","doi":"10.31222/osf.io/3bnxs","DOIUrl":"https://doi.org/10.31222/osf.io/3bnxs","url":null,"abstract":"Methods for random-effects meta-analysis require an estimate of the between-study variance, $tau^2$. The performance of estimators of $tau^2$ (measured by bias and coverage) affects their usefulness in assessing heterogeneity of study-level effects, and also the performance of related estimators of the overall effect. For the effect measure log-response-ratio (LRR, also known as the logarithm of the ratio of means, RoM), we review four point estimators of $tau^2$ (the popular methods of DerSimonian-Laird (DL), restricted maximum likelihood, and Mandel and Paule (MP), and the less-familiar method of Jackson), four interval estimators for $tau^2$ (profile likelihood, Q-profile, Biggerstaff and Jackson, and Jackson), five point estimators of the overall effect (the four related to the point estimators of $tau^2$ and an estimator whose weights use only study-level sample sizes), and seven interval estimators for the overall effect (four based on the point estimators for $tau^2$, the Hartung-Knapp-Sidik-Jonkman (HKSJ) interval, a modification of HKSJ that uses the MP estimator of $tau^2$ instead of the DL estimator, and an interval based on the sample-size-weighted estimator). We obtain empirical evidence from extensive simulations of data from normal distributions. Simulations from lognormal distributions are in a separate report Bakbergenuly et al. 2019b.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123229664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current status data are commonly encountered in medical and epidemiological studies in which the failure time for study units is the outcome variable of interest. Data of this form are characterized by the fact that the failure time is not directly observed but rather is known relative to an observation time; i.e., the failure times are either left- or right-censored. Due to its structure, the analysis of such data can be challenging. To circumvent these challenges and to provide for a flexible modeling construct which can be used to analyze current status data, herein, a partially linear additive transformation model is proposed. In the formulation of this model, constrained $B$-splines are employed to model the monotone transformation function and nonlinear covariate effects. To provide for more efficient estimates, a penalization technique is used to regularize the estimation of all unknown functions. An easy to implement hybrid algorithm is developed for model fitting and a simple estimator of the large-sample variance-covariance matrix is proposed. It is shown theoretically that the proposed estimators of the finite-dimensional regression coefficients are root-$n$ consistent, asymptotically normal, and achieve the semi-parametric information bound while the estimators of the nonparametric components attain the optimal rate of convergence. The finite-sample performance of the proposed methodology is evaluated through extensive numerical studies and is further demonstrated through the analysis of uterine leiomyomata data.
{"title":"A penalized likelihood approach for efficiently estimating a partially linear additive transformation model with current status data","authors":"Yan Liu, Minggen Lu, C. McMahan","doi":"10.1214/21-EJS1820","DOIUrl":"https://doi.org/10.1214/21-EJS1820","url":null,"abstract":"Current status data are commonly encountered in medical and epidemiological studies in which the failure time for study units is the outcome variable of interest. Data of this form are characterized by the fact that the failure time is not directly observed but rather is known relative to an observation time; i.e., the failure times are either left- or right-censored. Due to its structure, the analysis of such data can be challenging. To circumvent these challenges and to provide for a flexible modeling construct which can be used to analyze current status data, herein, a partially linear additive transformation model is proposed. In the formulation of this model, constrained $B$-splines are employed to model the monotone transformation function and nonlinear covariate effects. To provide for more efficient estimates, a penalization technique is used to regularize the estimation of all unknown functions. An easy to implement hybrid algorithm is developed for model fitting and a simple estimator of the large-sample variance-covariance matrix is proposed. It is shown theoretically that the proposed estimators of the finite-dimensional regression coefficients are root-$n$ consistent, asymptotically normal, and achieve the semi-parametric information bound while the estimators of the nonparametric components attain the optimal rate of convergence. The finite-sample performance of the proposed methodology is evaluated through extensive numerical studies and is further demonstrated through the analysis of uterine leiomyomata data.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122996417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Our interest is whether two binomial parameters differ, which parameter is larger, and by how much. This apparently simple problem was addressed by Fisher in the 1930's, and has been the subject of many review papers since then. Yet there continues to be new work on this issue and no consensus solution. Previous reviews have focused primarily on testing and the properties of validity and power, or primarily on confidence intervals, their coverage, and expected length. Here we evaluate both. For example, we consider whether a p-value and its matching confidence interval are compatible, meaning that the p-value rejects at level $alpha$ if and only if the $1-alpha$ confidence interval excludes all null parameter values. For focus, we only examine non-asymptotic inferences, so that most of the p-values and confidence intervals are valid (i.e., exact) by construction. Within this focus, we review different methods emphasizing many of the properties and interpretational aspects we desire from applied frequentist inference: validity, accuracy, good power, equivariance, compatibility, coherence, and parameterization and direction of effect. We show that no one method can meet all the desirable properties and give recommendations based on which properties are given more importance.
{"title":"Practical valid inferences for the two-sample binomial problem","authors":"M. Fay, S. Hunsberger","doi":"10.1214/21-SS131","DOIUrl":"https://doi.org/10.1214/21-SS131","url":null,"abstract":"Our interest is whether two binomial parameters differ, which parameter is larger, and by how much. This apparently simple problem was addressed by Fisher in the 1930's, and has been the subject of many review papers since then. Yet there continues to be new work on this issue and no consensus solution. Previous reviews have focused primarily on testing and the properties of validity and power, or primarily on confidence intervals, their coverage, and expected length. Here we evaluate both. For example, we consider whether a p-value and its matching confidence interval are compatible, meaning that the p-value rejects at level $alpha$ if and only if the $1-alpha$ confidence interval excludes all null parameter values. For focus, we only examine non-asymptotic inferences, so that most of the p-values and confidence intervals are valid (i.e., exact) by construction. Within this focus, we review different methods emphasizing many of the properties and interpretational aspects we desire from applied frequentist inference: validity, accuracy, good power, equivariance, compatibility, coherence, and parameterization and direction of effect. We show that no one method can meet all the desirable properties and give recommendations based on which properties are given more importance.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127623961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent paper Cand`es et al. (2018) introduced model-X knockoffs, a method for variable selection that provably and non-asymptotically controls the false discovery rate with no restrictions or assumptions on the dimensionality of the data or the conditional distribution of the response given the covariates. The one requirement for the procedure is that the covariate samples are drawn independently and identically from a precisely-known (but arbitrary) distribution. The present paper shows that the exact same guarantees can be made without knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as $Omega(n^{*}p)$ parameters, where $p$ is the dimension and $n^{*}$ is the number of covariate samples (which may exceed the usual sample size $n$ of labeled samples when unlabeled samples are also available). The key is to treat the covariates as if they are drawn conditionally on their observed value for a sufficient statistic of the model. Although this idea is simple, even in Gaussian models conditioning on a sufficient statistic leads to a distribution supported on a set of zero Lebesgue measure, requiring techniques from topological measure theory to establish valid algorithms. We demonstrate how to do this for three models of interest, with simulations showing the new approach remains powerful under the weaker assumptions.
{"title":"Relaxing the assumptions of knockoffs by conditioning","authors":"Dongming Huang, Lucas Janson","doi":"10.1214/19-AOS1920","DOIUrl":"https://doi.org/10.1214/19-AOS1920","url":null,"abstract":"The recent paper Cand`es et al. (2018) introduced model-X knockoffs, a method for variable selection that provably and non-asymptotically controls the false discovery rate with no restrictions or assumptions on the dimensionality of the data or the conditional distribution of the response given the covariates. The one requirement for the procedure is that the covariate samples are drawn independently and identically from a precisely-known (but arbitrary) distribution. The present paper shows that the exact same guarantees can be made without knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as $Omega(n^{*}p)$ parameters, where $p$ is the dimension and $n^{*}$ is the number of covariate samples (which may exceed the usual sample size $n$ of labeled samples when unlabeled samples are also available). The key is to treat the covariates as if they are drawn conditionally on their observed value for a sufficient statistic of the model. Although this idea is simple, even in Gaussian models conditioning on a sufficient statistic leads to a distribution supported on a set of zero Lebesgue measure, requiring techniques from topological measure theory to establish valid algorithms. We demonstrate how to do this for three models of interest, with simulations showing the new approach remains powerful under the weaker assumptions.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115750544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this contribution we discuss some approaches of network analysis providing information about single links or single nodes with respect to a null hypothesis taking into account the heterogeneity of the system empirically observed. With this approach, a selection of nodes and links is feasible when the null hypothesis is statistically rejected. We focus our discussion on approaches using (i) the so-called disparity filter and (ii) statistically validated network in bipartite networks. For both methods we discuss the importance of using multiple hypothesis test correction. Specific applications of statistically validated networks are discussed. We also discuss how statistically validated networks can be used to (i) pre-process large sets of data and (ii) detect cores of communities that are forming the most close-knit and stable subsets of clusters of nodes present in a complex system.
{"title":"A primer on statistically validated networks","authors":"S. Miccichè, R. Mantegna","doi":"10.3254/190007","DOIUrl":"https://doi.org/10.3254/190007","url":null,"abstract":"In this contribution we discuss some approaches of network analysis providing information about single links or single nodes with respect to a null hypothesis taking into account the heterogeneity of the system empirically observed. With this approach, a selection of nodes and links is feasible when the null hypothesis is statistically rejected. We focus our discussion on approaches using (i) the so-called disparity filter and (ii) statistically validated network in bipartite networks. For both methods we discuss the importance of using multiple hypothesis test correction. Specific applications of statistically validated networks are discussed. We also discuss how statistically validated networks can be used to (i) pre-process large sets of data and (ii) detect cores of communities that are forming the most close-knit and stable subsets of clusters of nodes present in a complex system.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123866559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-01-14DOI: 10.1515/9783110635461-005
C. Oates, J. Cockayne, D. Prangle, T. Sullivan, M. Girolami
It is well understood that Bayesian decision theory and average case analysis are essentially identical. However, if one is interested in performing uncertainty quantification for a numerical task, it can be argued that the decision-theoretic framework is neither appropriate nor sufficient. To this end, we consider an alternative optimality criterion from Bayesian experimental design and study its implied optimal information in the numerical context. This information is demonstrated to differ, in general, from the information that would be used in an average-case-optimal numerical method. The explicit connection to Bayesian experimental design suggests several distinct regimes in which optimal probabilistic numerical methods can be developed.
{"title":"5. Optimality criteria for probabilistic numerical methods","authors":"C. Oates, J. Cockayne, D. Prangle, T. Sullivan, M. Girolami","doi":"10.1515/9783110635461-005","DOIUrl":"https://doi.org/10.1515/9783110635461-005","url":null,"abstract":"It is well understood that Bayesian decision theory and average case analysis are essentially identical. However, if one is interested in performing uncertainty quantification for a numerical task, it can be argued that the decision-theoretic framework is neither appropriate nor sufficient. To this end, we consider an alternative optimality criterion from Bayesian experimental design and study its implied optimal information in the numerical context. This information is demonstrated to differ, in general, from the information that would be used in an average-case-optimal numerical method. The explicit connection to Bayesian experimental design suggests several distinct regimes in which optimal probabilistic numerical methods can be developed.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131725203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}