Empirical Bayes methods have been around for a long time and have a wide range of applications. These methods provide a way in which historical data can be aggregated to provide estimates of the posterior mean. This thesis revisits some of the empirical Bayesian methods and develops new applications. We first look at a linear empirical Bayes estimator and apply it on ranking and symbolic data. Next, we consider Tweedie's formula and show how it can be applied to analyze a microarray dataset. The application of the formula is simplified with the Pearson system of distributions. Saddlepoint approximations enable us to generalize several results in this direction. The results show that the proposed methods perform well in applications to real data sets.
{"title":"Revisiting Empirical Bayes Methods and Applications to Special Types of Data","authors":"XiuWen Duan","doi":"10.20381/RUOR-26562","DOIUrl":"https://doi.org/10.20381/RUOR-26562","url":null,"abstract":"Empirical Bayes methods have been around for a long time and have a wide range of applications. These methods provide a way in which historical data can be aggregated to provide estimates of the posterior mean. This thesis revisits some of the empirical Bayesian methods and develops new applications. We first look at a linear empirical Bayes estimator and apply it on ranking and symbolic data. Next, we consider Tweedie's formula and show how it can be applied to analyze a microarray dataset. The application of the formula is simplified with the Pearson system of distributions. Saddlepoint approximations enable us to generalize several results in this direction. The results show that the proposed methods perform well in applications to real data sets.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127127471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-26DOI: 10.48676/UNIBO/AMSDOTTORATO/9861
Marco Berrettini, G. Galimberti, Saverio Ranciati, T. B. Murphy
Mixture models provide a useful tool to account for unobserved heterogeneity, and are the basis of many model-based clustering methods. In order to gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates. In particular, prior probabilities of latent group membership can be linked to concomitant covariates through a multinomial logistic regression model, where each of these so-called component weights is associated with a linear predictor involving one or more of these variables. In this Thesis, this approach is extended by replacing the linear predictors with additive ones, where the contributions of some/all concomitant covariates can be represented by smooth functions. An estimation procedure within the Bayesian paradigm is proposed. In particular, a data augmentation scheme based on difference random utility models is exploited, and smoothness of the covariate effects is controlled by suitable choices for the prior distributions of the spline coefficients. This methodology is then extended to include flexible covariates effects also on the component densities. The performance of the proposed methodologies is investigated via simulation experiments and applications to real data. The content of the Thesis is organized as follows. In Chapter 1, a literature review about mixture models and mixture models with covariate effects is provided. After a brief introduction on Bayesian additive models with P-splines, the general specification for the proposed method is presented in Chapter 2, together with the associated Bayesian inference procedure. This approach is adapted to the specific case of categorical and continuous manifest variables in Chapter 3 and Chapter 4, respectively. In Chapter 5, the proposed methodology is extended to include flexible covariate effects also in the component densities. Finally, conclusions and remarks on the Thesis are collected in Chapter 6.
{"title":"Flexible Bayesian modelling of concomitant covariate effects in mixture models","authors":"Marco Berrettini, G. Galimberti, Saverio Ranciati, T. B. Murphy","doi":"10.48676/UNIBO/AMSDOTTORATO/9861","DOIUrl":"https://doi.org/10.48676/UNIBO/AMSDOTTORATO/9861","url":null,"abstract":"Mixture models provide a useful tool to account for unobserved heterogeneity, and are the basis of many model-based clustering methods. In order to gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates. In particular, prior probabilities of latent group membership can be linked to concomitant covariates through a multinomial logistic regression model, where each of these so-called component weights is associated with a linear predictor involving one or more of these variables. In this Thesis, this approach is extended by replacing the linear predictors with additive ones, where the contributions of some/all concomitant covariates can be represented by smooth functions. An estimation procedure within the Bayesian paradigm is proposed. In particular, a data augmentation scheme based on difference random utility models is exploited, and smoothness of the covariate effects is controlled by suitable choices for the prior distributions of the spline coefficients. This methodology is then extended to include flexible covariates effects also on the component densities. \u0000The performance of the proposed methodologies is investigated via simulation experiments and applications to real data. The content of the Thesis is organized as follows. In Chapter 1, a literature review about mixture models and mixture models with covariate effects is provided. After a brief introduction on Bayesian additive models with P-splines, the general specification for the proposed method is presented in Chapter 2, together with the associated Bayesian inference procedure. This approach is adapted to the specific case of categorical and continuous manifest variables in Chapter 3 and Chapter 4, respectively. \u0000In Chapter 5, the proposed methodology is extended to include flexible covariate effects also in the component densities. Finally, conclusions and remarks on the Thesis are collected in Chapter 6.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129103149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is largely taken for granted that differential abundance analysis is, by default, the best first step when analyzing genomic data. We argue that this is not necessarily the case. In this article, we identify key limitations that are intrinsic to differential abundance analysis: it is (a) dependent on unverifiable assumptions, (b) an unreliable construct, and (c) overly reductionist. We formulate an alternative framework called ratio-based biomarker analysis which does not suffer from the identified limitations. Moreover, ratio-based biomarkers are highly flexible. Beyond replacing DAA, they can also be used for many other bespoke analyses, including dimension reduction and multi-omics data integration.
{"title":"A Critique of Differential Abundance Analysis, and Advocacy for an Alternative","authors":"Thomas P. Quinn, E. Gordon-Rodríguez, Ionas Erb","doi":"10.5281/ZENODO.4692004","DOIUrl":"https://doi.org/10.5281/ZENODO.4692004","url":null,"abstract":"It is largely taken for granted that differential abundance analysis is, by default, the best first step when analyzing genomic data. We argue that this is not necessarily the case. In this article, we identify key limitations that are intrinsic to differential abundance analysis: it is (a) dependent on unverifiable assumptions, (b) an unreliable construct, and (c) overly reductionist. We formulate an alternative framework called ratio-based biomarker analysis which does not suffer from the identified limitations. Moreover, ratio-based biomarkers are highly flexible. Beyond replacing DAA, they can also be used for many other bespoke analyses, including dimension reduction and multi-omics data integration.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127024164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-30DOI: 10.1146/ANNUREVSTATISTICS-040220-091727
Leah F. South, M. Riabiz, Onur Teymur, C. Oates
Markov chain Monte Carlo is the engine of modern Bayesian statistics, being used to approximate the posterior and derived quantities of interest. Despite this, the issue of how the output from a Markov chain is post-processed and reported is often overlooked. Convergence diagnostics can be used to control bias via burn-in removal, but these do not account for (common) situations where a limited computational budget engenders a bias-variance trade-off. The aim of this article is to review state-of-the-art techniques for post-processing Markov chain output. Our review covers methods based on discrepancy minimisation, which directly address the bias-variance trade-off, as well as general-purpose control variate methods for approximating expected quantities of interest.
{"title":"Post-Processing of MCMC","authors":"Leah F. South, M. Riabiz, Onur Teymur, C. Oates","doi":"10.1146/ANNUREVSTATISTICS-040220-091727","DOIUrl":"https://doi.org/10.1146/ANNUREVSTATISTICS-040220-091727","url":null,"abstract":"Markov chain Monte Carlo is the engine of modern Bayesian statistics, being used to approximate the posterior and derived quantities of interest. Despite this, the issue of how the output from a Markov chain is post-processed and reported is often overlooked. Convergence diagnostics can be used to control bias via burn-in removal, but these do not account for (common) situations where a limited computational budget engenders a bias-variance trade-off. The aim of this article is to review state-of-the-art techniques for post-processing Markov chain output. Our review covers methods based on discrepancy minimisation, which \u0000directly address the bias-variance trade-off, as well as general-purpose control variate methods for approximating expected quantities of interest.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"318 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124295458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conditional Variance Estimation (CVE) is a novel sufficient dimension reduction (SDR) method for additive error regressions with continuous predictors and link function. It operates under the assumption that the predictors can be replaced by a lower dimensional projection without loss of information. In contrast to the majority of moment based sufficient dimension reduction methods, Conditional Variance Estimation is fully data driven, does not require the restrictive linearity and constant variance conditions, and is not based on inverse regression. CVE is shown to be consistent and its objective function to be uniformly convergent. CVE outperforms the mean average variance estimation, (MAVE), its main competitor, in several simulation settings, remains on par under others, while it always outperforms the usual inverse regression based linear SDR methods, such as Sliced Inverse Regression.
{"title":"Conditional variance estimator for sufficient dimension reduction","authors":"L. Fertl, E. Bura","doi":"10.3150/21-bej1402","DOIUrl":"https://doi.org/10.3150/21-bej1402","url":null,"abstract":"Conditional Variance Estimation (CVE) is a novel sufficient dimension reduction (SDR) method for additive error regressions with continuous predictors and link function. It operates under the assumption that the predictors can be replaced by a lower dimensional projection without loss of information. In contrast to the majority of moment based sufficient dimension reduction methods, Conditional Variance Estimation is fully data driven, does not require the restrictive linearity and constant variance conditions, and is not based on inverse regression. CVE is shown to be consistent and its objective function to be uniformly convergent. CVE outperforms the mean average variance estimation, (MAVE), its main competitor, in several simulation settings, remains on par under others, while it always outperforms the usual inverse regression based linear SDR methods, such as Sliced Inverse Regression.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117223705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-18DOI: 10.13140/RG.2.2.13339.69921
Ian A. Silver
When teaching and discussing statistical assumptions, our focus is oftentimes placed on how to test and address potential violations rather than the effects of violating assumptions on the estimates produced by our statistical models. The latter represents a potential avenue to help us better understand the impact of researcher degrees of freedom on the statistical estimates we produce. The Violating Assumptions Series is an endeavor I have undertaken to demonstrate the effects of violating assumptions on the estimates produced across various statistical models. The series will review assumptions associated with estimating causal associations, as well as more complicated statistical models including, but not limited to, multilevel models, path models, structural equation models, and Bayesian models. In addition to the primary goal, the series of posts is designed to illustrate how simulations can be used to develop a comprehensive understanding of applied statistics.
{"title":"The Violating Assumptions Series: Simulated demonstrations to illustrate how assumptions can affect statistical estimates","authors":"Ian A. Silver","doi":"10.13140/RG.2.2.13339.69921","DOIUrl":"https://doi.org/10.13140/RG.2.2.13339.69921","url":null,"abstract":"When teaching and discussing statistical assumptions, our focus is oftentimes placed on how to test and address potential violations rather than the effects of violating assumptions on the estimates produced by our statistical models. The latter represents a potential avenue to help us better understand the impact of researcher degrees of freedom on the statistical estimates we produce. The Violating Assumptions Series is an endeavor I have undertaken to demonstrate the effects of violating assumptions on the estimates produced across various statistical models. The series will review assumptions associated with estimating causal associations, as well as more complicated statistical models including, but not limited to, multilevel models, path models, structural equation models, and Bayesian models. In addition to the primary goal, the series of posts is designed to illustrate how simulations can be used to develop a comprehensive understanding of applied statistics.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126820360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing intelligence and integration, a great number of two-valued variables (generally stored in the form of 0 or 1 value) often exist in large-scale industrial processes. However, these variables cannot be effectively handled by traditional monitoring methods such as LDA, PCA and PLS. Recently, a mixed hidden naive Bayesian model (MHNBM) is developed for the first time to utilize both two-valued and continuous variables for abnormality monitoring. Although MHNBM is effective, it still has some shortcomings that need to be improved. For MHNBM, the variables with greater correlation to other variables have greater weights, which cannot guarantee greater weights are assigned to the more discriminating variables. In addition, the conditional probability must be computed based on the historical data. When the training data is scarce, the conditional probability between continuous variables tends to be uniformly distributed, which affects the performance of MHNBM. Here a novel feature weighted mixed naive Bayes model (FWMNBM) is developed to overcome the above shortcomings. For FWMNBM, the variables that are more correlated to the class have greater weights, which makes the more discriminating variables contribute more to the model. At the same time, FWMNBM does not have to calculate the conditional probability between variables, thus it is less restricted by the number of training data samples. Compared with MHNBM, FWMNBM has better performance, and its effectiveness is validated through the numerical cases of a simulation example and a practical case of Zhoushan thermal power plant (ZTPP), China.
{"title":"A Feature Weighted Mixed Naive Bayes Model for Monitoring Anomalies in the Fan System of a Thermal Power Plant.","authors":"Min Wang, Li Sheng, Donghua Zhou, Maoyin Chen","doi":"10.1109/JAS.2020.000000","DOIUrl":"https://doi.org/10.1109/JAS.2020.000000","url":null,"abstract":"With the increasing intelligence and integration, a great number of two-valued variables (generally stored in the form of 0 or 1 value) often exist in large-scale industrial processes. However, these variables cannot be effectively handled by traditional monitoring methods such as LDA, PCA and PLS. Recently, a mixed hidden naive Bayesian model (MHNBM) is developed for the first time to utilize both two-valued and continuous variables for abnormality monitoring. Although MHNBM is effective, it still has some shortcomings that need to be improved. For MHNBM, the variables with greater correlation to other variables have greater weights, which cannot guarantee greater weights are assigned to the more discriminating variables. In addition, the conditional probability must be computed based on the historical data. When the training data is scarce, the conditional probability between continuous variables tends to be uniformly distributed, which affects the performance of MHNBM. Here a novel feature weighted mixed naive Bayes model (FWMNBM) is developed to overcome the above shortcomings. For FWMNBM, the variables that are more correlated to the class have greater weights, which makes the more discriminating variables contribute more to the model. At the same time, FWMNBM does not have to calculate the conditional probability between variables, thus it is less restricted by the number of training data samples. Compared with MHNBM, FWMNBM has better performance, and its effectiveness is validated through the numerical cases of a simulation example and a practical case of Zhoushan thermal power plant (ZTPP), China.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133605919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many proposals have emerged as alternatives to the Heckman selection model, mainly to address the non-robustness of its normal assumption. The 2001 Medical Expenditure Panel Survey data is often used to illustrate this non-robustness of the Heckman model. In this paper, we propose a generalization of the Heckman sample selection model by allowing the sample selection bias and dispersion parameters to depend on covariates. We show that the non-robustness of the Heckman model may be due to the assumption of the constant sample selection bias parameter rather than the normality assumption. Our proposed methodology allows us to understand which covariates are important to explain the sample selection bias phenomenon rather than to only form conclusions about its presence. We explore the inferential aspects of the maximum likelihood estimators (MLEs) for our proposed generalized Heckman model. More specifically, we show that this model satisfies some regularity conditions such that it ensures consistency and asymptotic normality of the MLEs. Proper score residuals for sample selection models are provided, and model adequacy is addressed. Simulated results are presented to check the finite-sample behavior of the estimators and to verify the consequences of not considering varying sample selection bias and dispersion parameters. We show that the normal assumption for analyzing medical expenditure data is suitable and that the conclusions drawn using our approach are coherent with findings from prior literature. Moreover, we identify which covariates are relevant to explain the presence of sample selection bias in this important dataset.
{"title":"A Generalized Heckman Model With Varying Sample Selection Bias and Dispersion Parameters","authors":"F. D. S. Bastos, W. Barreto‐Souza, M. Genton","doi":"10.5705/SS.202021.0068","DOIUrl":"https://doi.org/10.5705/SS.202021.0068","url":null,"abstract":"Many proposals have emerged as alternatives to the Heckman selection model, mainly to address the non-robustness of its normal assumption. The 2001 Medical Expenditure Panel Survey data is often used to illustrate this non-robustness of the Heckman model. In this paper, we propose a generalization of the Heckman sample selection model by allowing the sample selection bias and dispersion parameters to depend on covariates. We show that the non-robustness of the Heckman model may be due to the assumption of the constant sample selection bias parameter rather than the normality assumption. Our proposed methodology allows us to understand which covariates are important to explain the sample selection bias phenomenon rather than to only form conclusions about its presence. We explore the inferential aspects of the maximum likelihood estimators (MLEs) for our proposed generalized Heckman model. More specifically, we show that this model satisfies some regularity conditions such that it ensures consistency and asymptotic normality of the MLEs. Proper score residuals for sample selection models are provided, and model adequacy is addressed. Simulated results are presented to check the finite-sample behavior of the estimators and to verify the consequences of not considering varying sample selection bias and dispersion parameters. We show that the normal assumption for analyzing medical expenditure data is suitable and that the conclusions drawn using our approach are coherent with findings from prior literature. Moreover, we identify which covariates are relevant to explain the presence of sample selection bias in this important dataset.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116945912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose double/debiased machine learning approaches to infer (at the parametric rate) the parametric component of a logistic partially linear model with the binary response following a conditional logistic model of a low dimensional linear parametric function of some key (exposure) covariates and a nonparametric function adjusting for the confounding effect of other covariates. We consider a Neyman orthogonal (doubly robust) score equation consisting of two nuisance functions: nonparametric component in the logistic model and conditional mean of the exposure on the other covariates and with the response fixed. To estimate the nuisance models, we separately consider the use of high dimensional (HD) sparse parametric models and more general (typically nonparametric) machine learning (ML) methods. In the HD case, we derive certain moment equations to calibrate the first-order bias of the nuisance models and grant our method a model double robustness property in the sense that our estimator achieves the desirable rate when at least one of the nuisance models is correctly specified and both of them are ultra-sparse. In the ML case, the non-linearity of the logit link makes it substantially harder than the partially linear setting to use an arbitrary conditional mean learning algorithm to estimate the nuisance component of the logistic model. We handle this obstacle through a novel full model refitting procedure that is easy-to-implement and facilitates the use of nonparametric ML algorithms in our framework. Our ML estimator is rate doubly robust in the same sense as Chernozhukov et al. (2018a). We evaluate our methods through simulation studies and apply them in assessing the effect of emergency contraceptive (EC) pill on early gestation foetal with a policy reform in Chile in 2008 (Bentancor and Clarke, 2017).
{"title":"Double/debiased machine learning for logistic partially linear model","authors":"Molei Liu, Yi Zhang, D. Zhou","doi":"10.1093/ECTJ/UTAB019","DOIUrl":"https://doi.org/10.1093/ECTJ/UTAB019","url":null,"abstract":"We propose double/debiased machine learning approaches to infer (at the parametric rate) the parametric component of a logistic partially linear model with the binary response following a conditional logistic model of a low dimensional linear parametric function of some key (exposure) covariates and a nonparametric function adjusting for the confounding effect of other covariates. We consider a Neyman orthogonal (doubly robust) score equation consisting of two nuisance functions: nonparametric component in the logistic model and conditional mean of the exposure on the other covariates and with the response fixed. To estimate the nuisance models, we separately consider the use of high dimensional (HD) sparse parametric models and more general (typically nonparametric) machine learning (ML) methods. In the HD case, we derive certain moment equations to calibrate the first-order bias of the nuisance models and grant our method a model double robustness property in the sense that our estimator achieves the desirable rate when at least one of the nuisance models is correctly specified and both of them are ultra-sparse. In the ML case, the non-linearity of the logit link makes it substantially harder than the partially linear setting to use an arbitrary conditional mean learning algorithm to estimate the nuisance component of the logistic model. We handle this obstacle through a novel full model refitting procedure that is easy-to-implement and facilitates the use of nonparametric ML algorithms in our framework. Our ML estimator is rate doubly robust in the same sense as Chernozhukov et al. (2018a). We evaluate our methods through simulation studies and apply them in assessing the effect of emergency contraceptive (EC) pill on early gestation foetal with a policy reform in Chile in 2008 (Bentancor and Clarke, 2017).","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-30DOI: 10.1101/2020.09.29.20204420
Z. Liu, T. Ye, B. Sun, M. Schooling, E. T. Tchetgen Tchetgen
Standard Mendelian randomization analysis can produce biased results if the genetic variant defining the instrumental variable (IV) is confounded and/or has a horizontal pleiotropic effect on the outcome of interest not mediated by the treatment. We provide novel identification conditions for the causal effect of a treatment in presence of unmeasured confounding, by leveraging an invalid IV for which both the IV independence and exclusion restriction assumptions may be violated. The proposed Mendelian randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) approach relies on (i) an assumption that the treatment effect does not vary with the invalid IV on the additive scale; and (ii) that the selection bias due to confounding does not vary with the invalid IV on the odds ratio scale; and (iii) that the residual variance for the outcome is heteroscedastic and thus varies with the invalid IV. Although assumptions (i) and (ii) have, respectively appeared in the IV literature, assumption (iii) has not; we formally establish that their conjunction can identify a causal effect even with an invalid IV subject to pleiotropy. MiSTERI is shown to be particularly advantageous in presence of pervasive heterogeneity of pleiotropic effects on additive scale, a setting in which two recently proposed robust estimation methods MR GxE and MR GENIUS can be severely biased. For estimation, we propose a simple and consistent three-stage estimator that can be used as preliminary estimator to a carefully constructed one-step-update estimator, which is guaranteed to be more efficient under the assumed model. In order to incorporate multiple, possibly correlated and weak IVs, a common challenge in MR studies, we develop a MAny Weak Invalid Instruments (MR MaWII MiSTERI) approach for strengthened identification and improved accuracy. We have developed an R package MR-MiSTERI for public use of all proposed methods. We illustrate MR MiSTERI in an application using UK Biobank data to evaluate the causal relationship between body mass index and glucose, thus obtaining inferences that are robust to unmeasured confounding, leveraging many weak and potentially invalid candidate genetic IVs. MaWII MiSTERI is shown to be robust to horizontal pleiotropy, violation of IV independence assumption and weak IV bias. Both simulation studies and real data analysis results demonstrate the robustness of the proposed MR MiSTERI methods.
{"title":"On Mendelian Randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) and Estimation for Causal Inference","authors":"Z. Liu, T. Ye, B. Sun, M. Schooling, E. T. Tchetgen Tchetgen","doi":"10.1101/2020.09.29.20204420","DOIUrl":"https://doi.org/10.1101/2020.09.29.20204420","url":null,"abstract":"Standard Mendelian randomization analysis can produce biased results if the genetic variant defining the instrumental variable (IV) is confounded and/or has a horizontal pleiotropic effect on the outcome of interest not mediated by the treatment. We provide novel identification conditions for the causal effect of a treatment in presence of unmeasured confounding, by leveraging an invalid IV for which both the IV independence and exclusion restriction assumptions may be violated. The proposed Mendelian randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) approach relies on (i) an assumption that the treatment effect does not vary with the invalid IV on the additive scale; and (ii) that the selection bias due to confounding does not vary with the invalid IV on the odds ratio scale; and (iii) that the residual variance for the outcome is heteroscedastic and thus varies with the invalid IV. Although assumptions (i) and (ii) have, respectively appeared in the IV literature, assumption (iii) has not; we formally establish that their conjunction can identify a causal effect even with an invalid IV subject to pleiotropy. MiSTERI is shown to be particularly advantageous in presence of pervasive heterogeneity of pleiotropic effects on additive scale, a setting in which two recently proposed robust estimation methods MR GxE and MR GENIUS can be severely biased. For estimation, we propose a simple and consistent three-stage estimator that can be used as preliminary estimator to a carefully constructed one-step-update estimator, which is guaranteed to be more efficient under the assumed model. In order to incorporate multiple, possibly correlated and weak IVs, a common challenge in MR studies, we develop a MAny Weak Invalid Instruments (MR MaWII MiSTERI) approach for strengthened identification and improved accuracy. We have developed an R package MR-MiSTERI for public use of all proposed methods. We illustrate MR MiSTERI in an application using UK Biobank data to evaluate the causal relationship between body mass index and glucose, thus obtaining inferences that are robust to unmeasured confounding, leveraging many weak and potentially invalid candidate genetic IVs. MaWII MiSTERI is shown to be robust to horizontal pleiotropy, violation of IV independence assumption and weak IV bias. Both simulation studies and real data analysis results demonstrate the robustness of the proposed MR MiSTERI methods.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126698660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}