arXiv: Methodology最新文献

英文中文

Revisiting Empirical Bayes Methods and Applications to Special Types of Data 回顾经验贝叶斯方法及其在特殊类型数据中的应用

arXiv: Methodology

Pub Date : 2021-06-29 DOI: 10.20381/RUOR-26562

XiuWen Duan

Empirical Bayes methods have been around for a long time and have a wide range of applications. These methods provide a way in which historical data can be aggregated to provide estimates of the posterior mean. This thesis revisits some of the empirical Bayesian methods and develops new applications. We first look at a linear empirical Bayes estimator and apply it on ranking and symbolic data. Next, we consider Tweedie's formula and show how it can be applied to analyze a microarray dataset. The application of the formula is simplified with the Pearson system of distributions. Saddlepoint approximations enable us to generalize several results in this direction. The results show that the proposed methods perform well in applications to real data sets.

经验贝叶斯方法已经存在了很长时间，并具有广泛的应用。这些方法提供了一种方法，可以将历史数据汇总起来，以提供后验均值的估计。本文回顾了一些经验贝叶斯方法，并开发了新的应用。我们首先看一个线性经验贝叶斯估计器，并将其应用于排名和符号数据。接下来，我们考虑Tweedie的公式，并展示如何将其应用于分析微阵列数据集。用皮尔逊分布系统简化了公式的应用。鞍点近似使我们能够在这个方向上推广几个结果。结果表明，该方法在实际数据集上的应用效果良好。

引用次数: 0

Flexible Bayesian modelling of concomitant covariate effects in mixture models 混合模型中伴随协变量效应的柔性贝叶斯建模

arXiv: Methodology

Pub Date : 2021-05-26 DOI: 10.48676/UNIBO/AMSDOTTORATO/9861

Marco Berrettini, G. Galimberti, Saverio Ranciati, T. B. Murphy

Mixture models provide a useful tool to account for unobserved heterogeneity, and are the basis of many model-based clustering methods. In order to gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates. In particular, prior probabilities of latent group membership can be linked to concomitant covariates through a multinomial logistic regression model, where each of these so-called component weights is associated with a linear predictor involving one or more of these variables. In this Thesis, this approach is extended by replacing the linear predictors with additive ones, where the contributions of some/all concomitant covariates can be represented by smooth functions. An estimation procedure within the Bayesian paradigm is proposed. In particular, a data augmentation scheme based on difference random utility models is exploited, and smoothness of the covariate effects is controlled by suitable choices for the prior distributions of the spline coefficients. This methodology is then extended to include flexible covariates effects also on the component densities. The performance of the proposed methodologies is investigated via simulation experiments and applications to real data. The content of the Thesis is organized as follows. In Chapter 1, a literature review about mixture models and mixture models with covariate effects is provided. After a brief introduction on Bayesian additive models with P-splines, the general specification for the proposed method is presented in Chapter 2, together with the associated Bayesian inference procedure. This approach is adapted to the specific case of categorical and continuous manifest variables in Chapter 3 and Chapter 4, respectively. In Chapter 5, the proposed methodology is extended to include flexible covariate effects also in the component densities. Finally, conclusions and remarks on the Thesis are collected in Chapter 6.

混合模型为解释未观察到的异质性提供了一个有用的工具，并且是许多基于模型的聚类方法的基础。为了获得额外的灵活性，一些模型参数可以表示为伴随协变量的函数。特别是，潜在群体成员的先验概率可以通过多项逻辑回归模型与伴随协变量联系起来，其中每个所谓的分量权重都与涉及一个或多个这些变量的线性预测器相关联。在本文中，将该方法扩展为用可加性预测代替线性预测，其中部分/所有伴随协变量的贡献可以用光滑函数表示。提出了一种基于贝叶斯范式的估计方法。特别提出了一种基于差分随机实用新型的数据增强方案，并通过选择合适的样条系数先验分布来控制协变量效应的平滑性。然后将该方法扩展到包括对组件密度的灵活协变量影响。通过仿真实验和对实际数据的应用，对所提出方法的性能进行了研究。论文的内容组织如下:第一章综述了混合模型和协变量混合模型的相关文献。在简要介绍了p样条贝叶斯加性模型之后，第2章给出了该方法的一般规范，以及相关的贝叶斯推理过程。这种方法分别适用于第3章和第4章中分类显变量和连续显变量的具体情况。在第5章中，提出的方法被扩展到包括灵活的协变量效应也在成分密度。最后，第六章是本文的结论和评论。

{"title":"Flexible Bayesian modelling of concomitant covariate effects in mixture models","authors":"Marco Berrettini, G. Galimberti, Saverio Ranciati, T. B. Murphy","doi":"10.48676/UNIBO/AMSDOTTORATO/9861","DOIUrl":"https://doi.org/10.48676/UNIBO/AMSDOTTORATO/9861","url":null,"abstract":"Mixture models provide a useful tool to account for unobserved heterogeneity, and are the basis of many model-based clustering methods. In order to gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates. In particular, prior probabilities of latent group membership can be linked to concomitant covariates through a multinomial logistic regression model, where each of these so-called component weights is associated with a linear predictor involving one or more of these variables. In this Thesis, this approach is extended by replacing the linear predictors with additive ones, where the contributions of some/all concomitant covariates can be represented by smooth functions. An estimation procedure within the Bayesian paradigm is proposed. In particular, a data augmentation scheme based on difference random utility models is exploited, and smoothness of the covariate effects is controlled by suitable choices for the prior distributions of the spline coefficients. This methodology is then extended to include flexible covariates effects also on the component densities. \u0000The performance of the proposed methodologies is investigated via simulation experiments and applications to real data. The content of the Thesis is organized as follows. In Chapter 1, a literature review about mixture models and mixture models with covariate effects is provided. After a brief introduction on Bayesian additive models with P-splines, the general specification for the proposed method is presented in Chapter 2, together with the associated Bayesian inference procedure. This approach is adapted to the specific case of categorical and continuous manifest variables in Chapter 3 and Chapter 4, respectively. \u0000In Chapter 5, the proposed methodology is extended to include flexible covariate effects also in the component densities. Finally, conclusions and remarks on the Thesis are collected in Chapter 6.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129103149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Critique of Differential Abundance Analysis, and Advocacy for an Alternative 对差异丰度分析的批判，并倡导一种替代方法

arXiv: Methodology

Pub Date : 2021-04-14 DOI: 10.5281/ZENODO.4692004

Thomas P. Quinn, E. Gordon-Rodríguez, Ionas Erb

It is largely taken for granted that differential abundance analysis is, by default, the best first step when analyzing genomic data. We argue that this is not necessarily the case. In this article, we identify key limitations that are intrinsic to differential abundance analysis: it is (a) dependent on unverifiable assumptions, (b) an unreliable construct, and (c) overly reductionist. We formulate an alternative framework called ratio-based biomarker analysis which does not suffer from the identified limitations. Moreover, ratio-based biomarkers are highly flexible. Beyond replacing DAA, they can also be used for many other bespoke analyses, including dimension reduction and multi-omics data integration.

在很大程度上，人们想当然地认为，在默认情况下，差异丰度分析是分析基因组数据的最佳第一步。我们认为事实并非如此。在本文中，我们确定了差异丰度分析固有的关键限制:它(a)依赖于无法验证的假设，(b)一个不可靠的结构，以及(c)过度简化。我们制定了一种称为基于比率的生物标志物分析的替代框架，它不会受到所确定的限制。此外，基于比率的生物标志物具有高度的灵活性。除了取代DAA，它们还可以用于许多其他定制分析，包括降维和多组学数据集成。

引用次数: 7

Post-Processing of MCMC MCMC的后处理

arXiv: Methodology

Pub Date : 2021-03-30 DOI: 10.1146/ANNUREVSTATISTICS-040220-091727

Leah F. South, M. Riabiz, Onur Teymur, C. Oates

Markov chain Monte Carlo is the engine of modern Bayesian statistics, being used to approximate the posterior and derived quantities of interest. Despite this, the issue of how the output from a Markov chain is post-processed and reported is often overlooked. Convergence diagnostics can be used to control bias via burn-in removal, but these do not account for (common) situations where a limited computational budget engenders a bias-variance trade-off. The aim of this article is to review state-of-the-art techniques for post-processing Markov chain output. Our review covers methods based on discrepancy minimisation, which directly address the bias-variance trade-off, as well as general-purpose control variate methods for approximating expected quantities of interest.

马尔可夫链蒙特卡罗是现代贝叶斯统计的引擎，被用来近似后验和衍生量的兴趣。尽管如此，如何对马尔可夫链的输出进行后处理和报告的问题经常被忽视。收敛诊断可用于通过消除老化来控制偏差，但这些并不能解释有限的计算预算导致偏差-方差权衡的(常见)情况。本文的目的是回顾最先进的技术后处理马尔可夫链输出。我们的综述涵盖了基于差异最小化的方法，这些方法直接解决了偏差-方差权衡问题，以及用于近似感兴趣的期望数量的通用控制变量方法。

引用次数: 0

Conditional variance estimator for sufficient dimension reduction 充分降维的条件方差估计

arXiv: Methodology

Pub Date : 2021-02-17 DOI: 10.3150/21-bej1402

L. Fertl, E. Bura

Conditional Variance Estimation (CVE) is a novel sufficient dimension reduction (SDR) method for additive error regressions with continuous predictors and link function. It operates under the assumption that the predictors can be replaced by a lower dimensional projection without loss of information. In contrast to the majority of moment based sufficient dimension reduction methods, Conditional Variance Estimation is fully data driven, does not require the restrictive linearity and constant variance conditions, and is not based on inverse regression. CVE is shown to be consistent and its objective function to be uniformly convergent. CVE outperforms the mean average variance estimation, (MAVE), its main competitor, in several simulation settings, remains on par under others, while it always outperforms the usual inverse regression based linear SDR methods, such as Sliced Inverse Regression.

条件方差估计(CVE)是一种新颖的充分降维方法，用于具有连续预测量和链接函数的加性误差回归。它是在假设预测器可以被低维投影代替而不丢失信息的情况下运行的。与大多数基于矩的充分降维方法相比，条件方差估计完全是数据驱动的，不需要限制性线性和常方差条件，也不基于逆回归。证明了CVE是一致的，其目标函数是一致收敛的。CVE在一些模拟设置中优于其主要竞争对手均值方差估计(MAVE)，在其他模拟设置中保持同等水平，同时它总是优于通常的基于逆回归的线性SDR方法，如切片逆回归。

引用次数: 5

The Violating Assumptions Series: Simulated demonstrations to illustrate how assumptions can affect statistical estimates 违反假设系列:模拟演示，说明假设如何影响统计估计

arXiv: Methodology

Pub Date : 2021-01-18 DOI: 10.13140/RG.2.2.13339.69921

Ian A. Silver

When teaching and discussing statistical assumptions, our focus is oftentimes placed on how to test and address potential violations rather than the effects of violating assumptions on the estimates produced by our statistical models. The latter represents a potential avenue to help us better understand the impact of researcher degrees of freedom on the statistical estimates we produce. The Violating Assumptions Series is an endeavor I have undertaken to demonstrate the effects of violating assumptions on the estimates produced across various statistical models. The series will review assumptions associated with estimating causal associations, as well as more complicated statistical models including, but not limited to, multilevel models, path models, structural equation models, and Bayesian models. In addition to the primary goal, the series of posts is designed to illustrate how simulations can be used to develop a comprehensive understanding of applied statistics.

当教授和讨论统计假设时，我们的重点通常放在如何测试和处理潜在的违反上，而不是违反假设对统计模型产生的估计的影响。后者代表了一种潜在的途径，可以帮助我们更好地理解研究者自由度对我们产生的统计估计的影响。违反假设系列是我所做的一项努力，旨在证明违反假设对各种统计模型产生的估计的影响。本系列将回顾与估计因果关系相关的假设，以及更复杂的统计模型，包括但不限于多层模型、路径模型、结构方程模型和贝叶斯模型。除了主要目标之外，这一系列文章还旨在说明如何使用模拟来培养对应用统计学的全面理解。

引用次数: 0

A Feature Weighted Mixed Naive Bayes Model for Monitoring Anomalies in the Fan System of a Thermal Power Plant. 火电厂风机系统异常监测的特征加权混合朴素贝叶斯模型。

arXiv: Methodology

Pub Date : 2020-12-14 DOI: 10.1109/JAS.2020.000000

Min Wang, Li Sheng, Donghua Zhou, Maoyin Chen

With the increasing intelligence and integration, a great number of two-valued variables (generally stored in the form of 0 or 1 value) often exist in large-scale industrial processes. However, these variables cannot be effectively handled by traditional monitoring methods such as LDA, PCA and PLS. Recently, a mixed hidden naive Bayesian model (MHNBM) is developed for the first time to utilize both two-valued and continuous variables for abnormality monitoring. Although MHNBM is effective, it still has some shortcomings that need to be improved. For MHNBM, the variables with greater correlation to other variables have greater weights, which cannot guarantee greater weights are assigned to the more discriminating variables. In addition, the conditional probability must be computed based on the historical data. When the training data is scarce, the conditional probability between continuous variables tends to be uniformly distributed, which affects the performance of MHNBM. Here a novel feature weighted mixed naive Bayes model (FWMNBM) is developed to overcome the above shortcomings. For FWMNBM, the variables that are more correlated to the class have greater weights, which makes the more discriminating variables contribute more to the model. At the same time, FWMNBM does not have to calculate the conditional probability between variables, thus it is less restricted by the number of training data samples. Compared with MHNBM, FWMNBM has better performance, and its effectiveness is validated through the numerical cases of a simulation example and a practical case of Zhoushan thermal power plant (ZTPP), China.

随着智能化和集成化程度的提高，大规模工业过程中往往存在大量的二值变量(一般以0或1值的形式存储)。然而，传统的LDA、PCA和PLS等监测方法无法有效处理这些变量，近年来首次提出了一种混合隐藏朴素贝叶斯模型(MHNBM)，将两值变量和连续变量同时用于异常监测。虽然MHNBM是有效的，但它仍然存在一些需要改进的缺点。对于MHNBM，与其他变量的相关性越大的变量权重越大，这并不能保证判别性越强的变量被赋予更大的权重。此外，还必须根据历史数据计算条件概率。当训练数据稀缺时，连续变量之间的条件概率趋于均匀分布，影响了MHNBM的性能。本文提出了一种新的特征加权混合朴素贝叶斯模型(FWMNBM)来克服上述缺点。对于FWMNBM，与类相关程度越高的变量权重越大，这使得判别性越强的变量对模型的贡献越大。同时，FWMNBM不需要计算变量之间的条件概率，因此较少受到训练数据样本数量的限制。与MHNBM相比，FWMNBM具有更好的性能，并通过仿真算例和舟山热电厂(ZTPP)的实际算例验证了其有效性。

{"title":"A Feature Weighted Mixed Naive Bayes Model for Monitoring Anomalies in the Fan System of a Thermal Power Plant.","authors":"Min Wang, Li Sheng, Donghua Zhou, Maoyin Chen","doi":"10.1109/JAS.2020.000000","DOIUrl":"https://doi.org/10.1109/JAS.2020.000000","url":null,"abstract":"With the increasing intelligence and integration, a great number of two-valued variables (generally stored in the form of 0 or 1 value) often exist in large-scale industrial processes. However, these variables cannot be effectively handled by traditional monitoring methods such as LDA, PCA and PLS. Recently, a mixed hidden naive Bayesian model (MHNBM) is developed for the first time to utilize both two-valued and continuous variables for abnormality monitoring. Although MHNBM is effective, it still has some shortcomings that need to be improved. For MHNBM, the variables with greater correlation to other variables have greater weights, which cannot guarantee greater weights are assigned to the more discriminating variables. In addition, the conditional probability must be computed based on the historical data. When the training data is scarce, the conditional probability between continuous variables tends to be uniformly distributed, which affects the performance of MHNBM. Here a novel feature weighted mixed naive Bayes model (FWMNBM) is developed to overcome the above shortcomings. For FWMNBM, the variables that are more correlated to the class have greater weights, which makes the more discriminating variables contribute more to the model. At the same time, FWMNBM does not have to calculate the conditional probability between variables, thus it is less restricted by the number of training data samples. Compared with MHNBM, FWMNBM has better performance, and its effectiveness is validated through the numerical cases of a simulation example and a practical case of Zhoushan thermal power plant (ZTPP), China.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133605919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A Generalized Heckman Model With Varying Sample Selection Bias and Dispersion Parameters 具有变样本选择偏差和色散参数的广义Heckman模型

arXiv: Methodology

Pub Date : 2020-12-03 DOI: 10.5705/SS.202021.0068

F. D. S. Bastos, W. Barreto‐Souza, M. Genton

Many proposals have emerged as alternatives to the Heckman selection model, mainly to address the non-robustness of its normal assumption. The 2001 Medical Expenditure Panel Survey data is often used to illustrate this non-robustness of the Heckman model. In this paper, we propose a generalization of the Heckman sample selection model by allowing the sample selection bias and dispersion parameters to depend on covariates. We show that the non-robustness of the Heckman model may be due to the assumption of the constant sample selection bias parameter rather than the normality assumption. Our proposed methodology allows us to understand which covariates are important to explain the sample selection bias phenomenon rather than to only form conclusions about its presence. We explore the inferential aspects of the maximum likelihood estimators (MLEs) for our proposed generalized Heckman model. More specifically, we show that this model satisfies some regularity conditions such that it ensures consistency and asymptotic normality of the MLEs. Proper score residuals for sample selection models are provided, and model adequacy is addressed. Simulated results are presented to check the finite-sample behavior of the estimators and to verify the consequences of not considering varying sample selection bias and dispersion parameters. We show that the normal assumption for analyzing medical expenditure data is suitable and that the conclusions drawn using our approach are coherent with findings from prior literature. Moreover, we identify which covariates are relevant to explain the presence of sample selection bias in this important dataset.

作为赫克曼选择模型的替代方案，已经出现了许多建议，主要是为了解决其正常假设的非稳健性。2001年医疗支出小组调查的数据经常被用来说明赫克曼模型的这种非稳健性。在本文中，我们通过允许样本选择偏差和分散参数依赖于协变量，提出了Heckman样本选择模型的推广。我们表明Heckman模型的非鲁棒性可能是由于假设样本选择偏差参数恒定而不是正态性假设。我们提出的方法使我们能够理解哪些协变量对于解释样本选择偏差现象是重要的，而不是仅仅形成关于其存在的结论。我们探讨了我们提出的广义Heckman模型的极大似然估计(MLEs)的推理方面。更具体地说，我们证明了该模型满足一些正则性条件，从而保证了最大似然概率的一致性和渐近正态性。提供了样本选择模型的适当分数残差，并解决了模型充分性问题。给出了模拟结果来检查估计器的有限样本行为，并验证不考虑变化的样本选择偏差和分散参数的后果。我们表明，分析医疗支出数据的正常假设是合适的，并且使用我们的方法得出的结论与先前文献的发现是一致的。此外，我们确定了哪些协变量与解释这个重要数据集中样本选择偏差的存在相关。

{"title":"A Generalized Heckman Model With Varying Sample Selection Bias and Dispersion Parameters","authors":"F. D. S. Bastos, W. Barreto‐Souza, M. Genton","doi":"10.5705/SS.202021.0068","DOIUrl":"https://doi.org/10.5705/SS.202021.0068","url":null,"abstract":"Many proposals have emerged as alternatives to the Heckman selection model, mainly to address the non-robustness of its normal assumption. The 2001 Medical Expenditure Panel Survey data is often used to illustrate this non-robustness of the Heckman model. In this paper, we propose a generalization of the Heckman sample selection model by allowing the sample selection bias and dispersion parameters to depend on covariates. We show that the non-robustness of the Heckman model may be due to the assumption of the constant sample selection bias parameter rather than the normality assumption. Our proposed methodology allows us to understand which covariates are important to explain the sample selection bias phenomenon rather than to only form conclusions about its presence. We explore the inferential aspects of the maximum likelihood estimators (MLEs) for our proposed generalized Heckman model. More specifically, we show that this model satisfies some regularity conditions such that it ensures consistency and asymptotic normality of the MLEs. Proper score residuals for sample selection models are provided, and model adequacy is addressed. Simulated results are presented to check the finite-sample behavior of the estimators and to verify the consequences of not considering varying sample selection bias and dispersion parameters. We show that the normal assumption for analyzing medical expenditure data is suitable and that the conclusions drawn using our approach are coherent with findings from prior literature. Moreover, we identify which covariates are relevant to explain the presence of sample selection bias in this important dataset.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116945912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Double/debiased machine learning for logistic partially linear model logistic部分线性模型的双/去偏机器学习

arXiv: Methodology

Pub Date : 2020-09-30 DOI: 10.1093/ECTJ/UTAB019

Molei Liu, Yi Zhang, D. Zhou

We propose double/debiased machine learning approaches to infer (at the parametric rate) the parametric component of a logistic partially linear model with the binary response following a conditional logistic model of a low dimensional linear parametric function of some key (exposure) covariates and a nonparametric function adjusting for the confounding effect of other covariates. We consider a Neyman orthogonal (doubly robust) score equation consisting of two nuisance functions: nonparametric component in the logistic model and conditional mean of the exposure on the other covariates and with the response fixed. To estimate the nuisance models, we separately consider the use of high dimensional (HD) sparse parametric models and more general (typically nonparametric) machine learning (ML) methods. In the HD case, we derive certain moment equations to calibrate the first-order bias of the nuisance models and grant our method a model double robustness property in the sense that our estimator achieves the desirable rate when at least one of the nuisance models is correctly specified and both of them are ultra-sparse. In the ML case, the non-linearity of the logit link makes it substantially harder than the partially linear setting to use an arbitrary conditional mean learning algorithm to estimate the nuisance component of the logistic model. We handle this obstacle through a novel full model refitting procedure that is easy-to-implement and facilitates the use of nonparametric ML algorithms in our framework. Our ML estimator is rate doubly robust in the same sense as Chernozhukov et al. (2018a). We evaluate our methods through simulation studies and apply them in assessing the effect of emergency contraceptive (EC) pill on early gestation foetal with a policy reform in Chile in 2008 (Bentancor and Clarke, 2017).

我们提出双/去偏机器学习方法来推断(以参数率)具有二进制响应的逻辑部分线性模型的参数成分，该模型遵循一些关键(暴露)协变量的低维线性参数函数的条件逻辑模型和调整其他协变量混杂效应的非参数函数。我们考虑一个Neyman正交(双鲁棒)分数方程，它由两个干扰函数组成:逻辑模型中的非参数分量和其他协变量上暴露的条件平均值，并且响应固定。为了估计干扰模型，我们分别考虑使用高维(HD)稀疏参数模型和更一般的(通常是非参数的)机器学习(ML)方法。在HD情况下，我们推导了一定的矩方程来校准干扰模型的一阶偏差，并赋予我们的方法模型双鲁棒性，即当至少一个干扰模型被正确指定并且两个模型都是超稀疏的时候，我们的估计器达到了理想的率。在ML的情况下，logit链接的非线性使得使用任意条件平均学习算法来估计逻辑模型的讨厌成分比部分线性设置要困难得多。我们通过一种新颖的全模型重构过程来处理这一障碍，该过程易于实现，并且有助于在我们的框架中使用非参数ML算法。在与Chernozhukov等人(2018a)相同的意义上，我们的ML估计器具有倍率鲁棒性。我们通过模拟研究评估了我们的方法，并将其应用于2008年智利政策改革中评估紧急避孕药(EC)对早孕胎儿的影响(Bentancor和Clarke, 2017)。

{"title":"Double/debiased machine learning for logistic partially linear model","authors":"Molei Liu, Yi Zhang, D. Zhou","doi":"10.1093/ECTJ/UTAB019","DOIUrl":"https://doi.org/10.1093/ECTJ/UTAB019","url":null,"abstract":"We propose double/debiased machine learning approaches to infer (at the parametric rate) the parametric component of a logistic partially linear model with the binary response following a conditional logistic model of a low dimensional linear parametric function of some key (exposure) covariates and a nonparametric function adjusting for the confounding effect of other covariates. We consider a Neyman orthogonal (doubly robust) score equation consisting of two nuisance functions: nonparametric component in the logistic model and conditional mean of the exposure on the other covariates and with the response fixed. To estimate the nuisance models, we separately consider the use of high dimensional (HD) sparse parametric models and more general (typically nonparametric) machine learning (ML) methods. In the HD case, we derive certain moment equations to calibrate the first-order bias of the nuisance models and grant our method a model double robustness property in the sense that our estimator achieves the desirable rate when at least one of the nuisance models is correctly specified and both of them are ultra-sparse. In the ML case, the non-linearity of the logit link makes it substantially harder than the partially linear setting to use an arbitrary conditional mean learning algorithm to estimate the nuisance component of the logistic model. We handle this obstacle through a novel full model refitting procedure that is easy-to-implement and facilitates the use of nonparametric ML algorithms in our framework. Our ML estimator is rate doubly robust in the same sense as Chernozhukov et al. (2018a). We evaluate our methods through simulation studies and apply them in assessing the effect of emergency contraceptive (EC) pill on early gestation foetal with a policy reform in Chile in 2008 (Bentancor and Clarke, 2017).","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

On Mendelian Randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) and Estimation for Causal Inference 孟德尔随机化混合尺度治疗效应的鲁棒识别与因果推理估计

arXiv: Methodology

Pub Date : 2020-09-30 DOI: 10.1101/2020.09.29.20204420

Z. Liu, T. Ye, B. Sun, M. Schooling, E. T. Tchetgen Tchetgen

Standard Mendelian randomization analysis can produce biased results if the genetic variant defining the instrumental variable (IV) is confounded and/or has a horizontal pleiotropic effect on the outcome of interest not mediated by the treatment. We provide novel identification conditions for the causal effect of a treatment in presence of unmeasured confounding, by leveraging an invalid IV for which both the IV independence and exclusion restriction assumptions may be violated. The proposed Mendelian randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) approach relies on (i) an assumption that the treatment effect does not vary with the invalid IV on the additive scale; and (ii) that the selection bias due to confounding does not vary with the invalid IV on the odds ratio scale; and (iii) that the residual variance for the outcome is heteroscedastic and thus varies with the invalid IV. Although assumptions (i) and (ii) have, respectively appeared in the IV literature, assumption (iii) has not; we formally establish that their conjunction can identify a causal effect even with an invalid IV subject to pleiotropy. MiSTERI is shown to be particularly advantageous in presence of pervasive heterogeneity of pleiotropic effects on additive scale, a setting in which two recently proposed robust estimation methods MR GxE and MR GENIUS can be severely biased. For estimation, we propose a simple and consistent three-stage estimator that can be used as preliminary estimator to a carefully constructed one-step-update estimator, which is guaranteed to be more efficient under the assumed model. In order to incorporate multiple, possibly correlated and weak IVs, a common challenge in MR studies, we develop a MAny Weak Invalid Instruments (MR MaWII MiSTERI) approach for strengthened identification and improved accuracy. We have developed an R package MR-MiSTERI for public use of all proposed methods. We illustrate MR MiSTERI in an application using UK Biobank data to evaluate the causal relationship between body mass index and glucose, thus obtaining inferences that are robust to unmeasured confounding, leveraging many weak and potentially invalid candidate genetic IVs. MaWII MiSTERI is shown to be robust to horizontal pleiotropy, violation of IV independence assumption and weak IV bias. Both simulation studies and real data analysis results demonstrate the robustness of the proposed MR MiSTERI methods.

如果定义工具变量(IV)的遗传变异是混杂的和/或对感兴趣的结果有水平多效性影响，而不是由治疗介导，标准孟德尔随机化分析可能会产生偏倚的结果。我们通过利用可能违反静脉独立性和排除限制假设的无效静脉，为存在未测量混杂的治疗因果效应提供了新的鉴定条件。提出的孟德尔随机化混合尺度治疗效果稳健识别(MR MiSTERI)方法依赖于(i)治疗效果不随无效IV在加性尺度上变化的假设;(ii)混杂导致的选择偏倚在比值比量表上不随无效IV而变化;(iii)结果的残差是异方差的，因此随无效的IV而变化。尽管假设(i)和(ii)分别出现在IV文献中，但假设(iii)没有;我们正式确立，他们的结合可以识别因果效应，即使无效的静脉多效性。MiSTERI被证明在多效性效应普遍存在异质性的情况下具有特别的优势，在这种情况下，最近提出的两种鲁棒估计方法MR GxE和MR GENIUS可能存在严重偏差。对于估计，我们提出了一个简单且一致的三阶段估计器，它可以作为一个精心构造的一步更新估计器的初步估计器，保证在假设的模型下更有效。为了纳入多个可能相关的弱IVs，这是MR研究中的一个常见挑战，我们开发了一个多弱无效仪器(MR MaWII MiSTERI)方法，以加强识别和提高准确性。我们已经开发了一个R包MR-MiSTERI供公众使用所有建议的方法。我们利用英国生物银行(UK Biobank)的数据来评估体重指数和葡萄糖之间的因果关系，从而利用许多弱的和潜在无效的候选遗传IVs，获得对未测量的混杂因素具有鲁强性的推论。MaWII MiSTERI被证明对水平多效性、违反IV独立性假设和弱IV偏倚具有鲁棒性。仿真研究和实际数据分析结果都证明了所提出的MR MiSTERI方法的鲁棒性。

{"title":"On Mendelian Randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) and Estimation for Causal Inference","authors":"Z. Liu, T. Ye, B. Sun, M. Schooling, E. T. Tchetgen Tchetgen","doi":"10.1101/2020.09.29.20204420","DOIUrl":"https://doi.org/10.1101/2020.09.29.20204420","url":null,"abstract":"Standard Mendelian randomization analysis can produce biased results if the genetic variant defining the instrumental variable (IV) is confounded and/or has a horizontal pleiotropic effect on the outcome of interest not mediated by the treatment. We provide novel identification conditions for the causal effect of a treatment in presence of unmeasured confounding, by leveraging an invalid IV for which both the IV independence and exclusion restriction assumptions may be violated. The proposed Mendelian randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) approach relies on (i) an assumption that the treatment effect does not vary with the invalid IV on the additive scale; and (ii) that the selection bias due to confounding does not vary with the invalid IV on the odds ratio scale; and (iii) that the residual variance for the outcome is heteroscedastic and thus varies with the invalid IV. Although assumptions (i) and (ii) have, respectively appeared in the IV literature, assumption (iii) has not; we formally establish that their conjunction can identify a causal effect even with an invalid IV subject to pleiotropy. MiSTERI is shown to be particularly advantageous in presence of pervasive heterogeneity of pleiotropic effects on additive scale, a setting in which two recently proposed robust estimation methods MR GxE and MR GENIUS can be severely biased. For estimation, we propose a simple and consistent three-stage estimator that can be used as preliminary estimator to a carefully constructed one-step-update estimator, which is guaranteed to be more efficient under the assumed model. In order to incorporate multiple, possibly correlated and weak IVs, a common challenge in MR studies, we develop a MAny Weak Invalid Instruments (MR MaWII MiSTERI) approach for strengthened identification and improved accuracy. We have developed an R package MR-MiSTERI for public use of all proposed methods. We illustrate MR MiSTERI in an application using UK Biobank data to evaluate the causal relationship between body mass index and glucose, thus obtaining inferences that are robust to unmeasured confounding, leveraging many weak and potentially invalid candidate genetic IVs. MaWII MiSTERI is shown to be robust to horizontal pleiotropy, violation of IV independence assumption and weak IV bias. Both simulation studies and real data analysis results demonstrate the robustness of the proposed MR MiSTERI methods.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126698660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv: Methodology

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀