This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations replicated under different conditions may be more interesting. In fact, consistency sometimes provably leads to valid causal inferences even if conditional associations do not. While the proposed method is widely applicable, this paper highlights its relevance to genome-wide association studies, in which robustness across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to the UK Biobank data.
A general framework is set up to study the asymptotic properties of the intent-to-treat Wilcoxon-Mann-Whitney test in randomized experiments with nonignorable noncompliance. Under location-shift alternatives, the Pitman efficiencies of the intent-to-treat Wilcoxon-Mann-Whitney and [Formula: see text] tests are derived. It is shown that the former is superior if the compliers are more likely to be found in high-density regions of the outcome distribution or, equivalently, if the noncompliers tend to reside in the tails. By logical extension, the relative efficiency of the two tests is sharply bounded by least and most favourable scenarios in which the compliers are segregated into regions of lowest and highest density, respectively. Such bounds can be derived analytically as a function of the compliance rate for common location families such as Gaussian, Laplace, logistic and [Formula: see text] distributions. These results can help empirical researchers choose the more efficient test for existing data, and calculate sample size for future trials in anticipation of noncompliance. Results for nonadditive alternatives and other tests follow along similar lines.
Zero-inflated nonnegative outcomes are common in many applications. In this work, motivated by freemium mobile game data, we propose a class of multiplicative structural nested mean models for zero-inflated nonnegative outcomes which flexibly describes the joint effect of a sequence of treatments in the presence of time-varying confounders. The proposed estimator solves a doubly robust estimating equation, where the nuisance functions, namely the propensity score and conditional outcome means given confounders, are estimated parametrically or nonparametrically. To improve the accuracy, we leverage the characteristic of zero-inflated outcomes by estimating the conditional means in two parts, that is, separately modelling the probability of having positive outcomes given confounders, and the mean outcome conditional on its being positive and given the confounders. We show that the proposed estimator is consistent and asymptotically normal as either the sample size or the follow-up time goes to infinity. Moreover, the typical sandwich formula can be used to estimate the variance of treatment effect estimators consistently, without accounting for the variation due to estimating nuisance functions. Simulation studies and an application to a freemium mobile game dataset are presented to demonstrate the empirical performance of the proposed method and support our theoretical findings.
We propose a reinforcement learning method for estimating an optimal dynamic treatment regime for survival outcomes with dependent censoring. The estimator allows the failure time to be conditionally independent of censoring and dependent on the treatment decision times, supports a flexible number of treatment arms and treatment stages, and can maximize either the mean survival time or the survival probability at a certain time-point. The estimator is constructed using generalized random survival forests and can have polynomial rates of convergence. Simulations and analysis of the Atherosclerosis Risk in Communities study data suggest that the new estimator brings higher expected outcomes than existing methods in various settings.
Sparse principal component analysis is an important technique for simultaneous dimensionality reduction and variable selection with high-dimensional data. In this work we combine the unique geometric structure of the sparse principal component analysis problem with recent advances in convex optimization to develop novel gradient-based sparse principal component analysis algorithms. These algorithms enjoy the same global convergence guarantee as the original alternating direction method of multipliers, and can be more efficiently implemented with the rich toolbox developed for gradient methods from the deep learning literature. Most notably, these gradient-based algorithms can be combined with stochastic gradient descent methods to produce efficient online sparse principal component analysis algorithms with provable numerical and statistical performance guarantees. The practical performance and usefulness of the new algorithms are demonstrated in various simulation studies. As an application, we show how the scalability and statistical accuracy of our method enable us to find interesting functional gene groups in high-dimensional RNA sequencing data.