How to analyze data when there is violation of the positivity assumption? Several possible solutions exist in the literature. In this paper, we consider propensity score (PS) methods that are commonly used in observational studies to assess causal treatment effects in the context where the positivity assumption is violated. We focus on and examine four specific alternative solutions to the inverse probability weighting (IPW) trimming and truncation: matching weight (MW), Shannon's entropy weight (EW), overlap weight (OW), and beta weight (BW) estimators.
We first specify their target population, the population of patients for whom clinical equipoise, that is, where we have sufficient PS overlap. Then, we establish the nexus among the different corresponding weights (and estimators); this allows us to highlight the shared properties and theoretical implications of these estimators. Finally, we introduce their augmented estimators that take advantage of estimating both the propensity score and outcome regression models to enhance the treatment effect estimators in terms of bias and efficiency. We also elucidate the role of the OW estimator as the flagship of all these methods that target the overlap population.
Our analytic results demonstrate that OW, MW, and EW are preferable to IPW and some cases of BW when there is a moderate or extreme (stochastic or structural) violation of the positivity assumption. We then evaluate, compare, and confirm the finite-sample performance of the aforementioned estimators via Monte Carlo simulations. Finally, we illustrate these methods using two real-world data examples marked by violations of the positivity assumption.
{"title":"Causal inference in the absence of positivity: The role of overlap weights","authors":"Roland A. Matsouaka, Yunji Zhou","doi":"10.1002/bimj.202300156","DOIUrl":"10.1002/bimj.202300156","url":null,"abstract":"<p>How to analyze data when there is violation of the positivity assumption? Several possible solutions exist in the literature. In this paper, we consider propensity score (PS) methods that are commonly used in observational studies to assess causal treatment effects in the context where the positivity assumption is violated. We focus on and examine four specific alternative solutions to the inverse probability weighting (IPW) trimming and truncation: matching weight (MW), Shannon's entropy weight (EW), overlap weight (OW), and beta weight (BW) estimators.</p><p>We first specify their target population, the population of patients for whom clinical equipoise, that is, where we have sufficient PS overlap. Then, we establish the nexus among the different corresponding weights (and estimators); this allows us to highlight the shared properties and theoretical implications of these estimators. Finally, we introduce their augmented estimators that take advantage of estimating both the propensity score and outcome regression models to enhance the treatment effect estimators in terms of bias and efficiency. We also elucidate the role of the OW estimator as the flagship of all these methods that target the overlap population.</p><p>Our analytic results demonstrate that OW, MW, and EW are preferable to IPW and some cases of BW when there is a moderate or extreme (stochastic or structural) violation of the positivity assumption. We then evaluate, compare, and confirm the finite-sample performance of the aforementioned estimators via Monte Carlo simulations. Finally, we illustrate these methods using two real-world data examples marked by violations of the positivity assumption.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141285482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Planterose Jiménez, Manfred Kayser, Athina Vidaki, Amke Caliebe
Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.
线性回归(LR)广泛应用于生物医学和流行病学中连续结果的数据分析。尽管线性回归很受欢迎,但它与缺失数据不兼容,而缺失数据在健康科学中经常出现。在参数估计中,这一缺陷通常通过完整案例分析或估算来解决。然而,这两种变通方法都不足以进行预测,因为它们要么无法对不完整的记录进行预测,要么忽略了缺失导致的预测准确性下降,并且依赖于对缺失机制的(不切实际的)假设。在这里,我们推导出了自适应预测集线性模型(aps-lm),它无需估算就能对不完整数据进行预测。它是通过使用预测器选择操作、摩尔-彭罗斯(Moore-Penrose)伪逆和还原 QR 分解得出的。它应用于参考数据集(其中有完整的预测因子和结果),并产生一组保护隐私的参数。在第二阶段,这些参数将被共享,用于对外部数据集的结果进行预测,外部数据集中的预测因子有缺失项,无需估算。此外,即使在极端缺失的情况下,aps-lm 也能计算出考虑到缺失值模式的预测误差。我们在模拟研究中对 aps-lm 进行了基准测试。与流行的估算策略相比,aps-lm 在样本量、拟合度、缺失值类型和协方差结构等多种情况下都显示出更高的预测准确性和更小的偏差。最后,作为原理验证,我们将 aps-lm 应用于表观遗传衰老时钟,这种线性模型可以从表观遗传数据中预测一个人的生物年龄,具有良好的临床应用前景。
{"title":"Adaptive predictor-set linear model: An imputation-free method for linear regression prediction on data sets with missing values","authors":"Benjamin Planterose Jiménez, Manfred Kayser, Athina Vidaki, Amke Caliebe","doi":"10.1002/bimj.202300090","DOIUrl":"10.1002/bimj.202300090","url":null,"abstract":"<p>Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300090","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141177096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a Bayesian approach for biclustering that accounts for the prior functional dependence between genes using hidden Markov models (HMMs). We utilize biological knowledge gathered from gene ontologies and the hidden Markov structure to capture the potential coexpression of neighboring genes. Our interpretable model-based clustering characterized each cluster of samples by three groups of features: overexpressed, underexpressed, and irrelevant features. The proposed methods have been implemented in an R package and are used to analyze both the simulated data and The Cancer Genome Atlas kidney cancer data.
我们介绍了一种贝叶斯双聚类方法,该方法利用隐马尔可夫模型(HMM)考虑了基因之间的先验功能依赖性。我们利用从基因本体和隐马尔可夫结构中收集的生物知识来捕捉相邻基因的潜在共表达。我们基于可解释模型的聚类方法通过三组特征来表征每个样本集群:过度表达、表达不足和无关特征。提出的方法已在 R 软件包中实现,并用于分析模拟数据和癌症基因组图谱肾癌数据。
{"title":"A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data","authors":"Thierry Chekouo, Himadri Mukherjee","doi":"10.1002/bimj.202300173","DOIUrl":"10.1002/bimj.202300173","url":null,"abstract":"<p>We introduce a Bayesian approach for biclustering that accounts for the prior functional dependence between genes using hidden Markov models (HMMs). We utilize biological knowledge gathered from gene ontologies and the hidden Markov structure to capture the potential coexpression of neighboring genes. Our interpretable model-based clustering characterized each cluster of samples by three groups of features: overexpressed, underexpressed, and irrelevant features. The proposed methods have been implemented in an R package and are used to analyze both the simulated data and The Cancer Genome Atlas kidney cancer data.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300173","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141181467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In observational studies, instrumental variable (IV) methods are commonly applied when there are unmeasured covariates. In Mendelian randomization, constructing an allele score using many single nucleotide polymorphisms is often implemented; however, estimating biased causal effects by including some invalid IVs poses some risks. Invalid IVs are those IV candidates that are associated with unobserved variables. To solve this problem, we developed a novel strategy using negative control outcomes (NCOs) as auxiliary variables. Using NCOs, we are able to select only valid IVs and exclude invalid IVs without knowing which of the instruments are invalid. We also developed a new two-step estimation procedure and proved the semiparametric efficiency of our estimator. The performance of our proposed method was superior to some previous methods through simulations. Subsequently, we applied the proposed method to the UK Biobank dataset. Our results demonstrate that the use of an auxiliary variable, such as an NCO, enables the selection of valid IVs with assumptions different from those used in previous methods.
在观察性研究中,当存在无法测量的协变量时,通常会采用工具变量(IV)方法。在孟德尔随机化中,通常会使用许多单核苷酸多态性来构建等位基因得分;然而,通过包含一些无效的 IV 来估计有偏差的因果效应会带来一些风险。无效的 IV 是指那些与非观测变量相关的 IV 候选者。为了解决这个问题,我们开发了一种新策略,将负控制结果(NCOs)作为辅助变量。利用负控制结果,我们可以只选择有效的 IV,排除无效的 IV,而无需知道哪些工具是无效的。我们还开发了一种新的两步估计程序,并证明了我们的估计器的半参数效率。通过模拟,我们提出的方法的性能优于之前的一些方法。随后,我们将提出的方法应用于英国生物库数据集。我们的结果表明,使用辅助变量(如 NCO)可以选择有效的 IV,其假设条件与之前的方法不同。
{"title":"Valid instrumental variable selection method using negative control outcomes and constructing efficient estimator","authors":"Shunichiro Orihara, Atsushi Goto, Masataka Taguri","doi":"10.1002/bimj.202300113","DOIUrl":"10.1002/bimj.202300113","url":null,"abstract":"<p>In observational studies, instrumental variable (IV) methods are commonly applied when there are unmeasured covariates. In Mendelian randomization, constructing an allele score using many single nucleotide polymorphisms is often implemented; however, estimating biased causal effects by including some invalid IVs poses some risks. Invalid IVs are those IV candidates that are associated with unobserved variables. To solve this problem, we developed a novel strategy using negative control outcomes (NCOs) as auxiliary variables. Using NCOs, we are able to select only valid IVs and exclude invalid IVs without knowing which of the instruments are invalid. We also developed a new two-step estimation procedure and proved the semiparametric efficiency of our estimator. The performance of our proposed method was superior to some previous methods through simulations. Subsequently, we applied the proposed method to the UK Biobank dataset. Our results demonstrate that the use of an auxiliary variable, such as an NCO, enables the selection of valid IVs with assumptions different from those used in previous methods.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141155844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lutecia Servius, Davide Pigoli, Joseph Ng, Franca Fraternali
Statistical and machine learning methods have proved useful in many areas of immunology. In this paper, we address for the first time the problem of predicting the occurrence of class switch recombination (CSR) in B-cells, a problem of interest in understanding antibody response under immunological challenges. We propose a framework to analyze antibody repertoire data, based on clonal (CG) group representation in a way that allows us to predict CSR events using CG level features as input. We assess and compare the performance of several predicting models (logistic regression, LASSO logistic regression, random forest, and support vector machine) in carrying out this task. The proposed approach can obtain an unweighted average recall of