An important step for any causal inference study design is understanding the distribution of the subjects in terms of measured baseline covariates. However, not all baseline variation is equally important. We propose a set of visualizations that reduce the space of measured covariates into two components of baseline variation important to the design of an observational causal inference study: a propensity score summarizing baseline variation associated with treatment assignment, and prognostic score summarizing baseline variation associated with the untreated potential outcome. These assignment-control plots and variations thereof visualize study design trade-offs and illustrate core methodological concepts in causal inference. As a practical demonstration, we apply assignment-control plots to a hypothetical study of cardiothoracic surgery. To demonstrate how these plots can be used to illustrate nuanced concepts, we use them to visualize unmeasured confounding and to consider the relationship between propensity scores and instrumental variables. While the family of visualization tools for studies of causality is relatively sparse, simple visual tools can be an asset to education, application, and methods development.
In the paired data setting, the sign test is often described in statistical textbooks as a test for comparing differences between the medians of two marginal distributions. There is an implicit assumption that the median of the differences is equivalent to the difference of the medians when employing the sign test in this fashion. We demonstrate however that given asymmetry in the bivariate distribution of the paired data, there are often scenarios where the median of the differences is not equal to the difference of the medians. Further, we show that these scenarios will lead to a false interpretation of the sign test for its intended use in the paired data setting. We illustrate the false-interpretation concept via theory, a simulation study, and through a real-world example based on breast cancer RNA sequencing data obtained from the Cancer Genome Atlas (TCGA).
Posterior uncertainty is typically summarized as a credible interval, an interval in the parameter space that contains a fixed proportion - usually 95% - of the posterior's support. For multivariate parameters, credible sets perform the same role. There are of course many potential 95% intervals from which to choose, yet even standard choices are rarely justified in any formal way. In this paper we give a general method, focusing on the loss function that motivates an estimate - the Bayes rule - around which we construct a credible set. The set contains all points which, as estimates, would have minimally-worse expected loss than the Bayes rule: we call this excess expected loss 'regret'. The approach can be used for any model and prior, and we show how it justifies all widely-used choices of credible interval/set. Further examples show how it provides insights into more complex estimation problems.
Health inequities are assessed by health departments to identify social groups disproportionately burdened by disease and by academic researchers to understand how social, economic, and environmental inequities manifest as health inequities. To characterize inequities, group-specific small-area health data are often modeled using log-linear generalized linear models (GLM) or generalized linear mixed models (GLMM) with a random intercept. These approaches estimate the same marginal rate ratio comparing disease rates across groups under standard assumptions. Here we explore how residential segregation combined with social group differences in disease risk can lead to contradictory findings from the GLM and GLMM. We show that this occurs because small-area disease rate data collected under these conditions induce endogeneity in the GLMM due to correlation between the model's offset and random effect. This results in GLMM estimates that represent conditional rather than marginal associations. We refer to endogeneity arising from the offset, which to our knowledge has not been noted previously, as "offset endogeneity". We illustrate this phenomenon in simulated data and real premature mortality data, and we propose alternative modeling approaches to address it. We also introduce to a statistical audience the social epidemiologic terminology for framing health inequities, which enables responsible interpretation of results.
Hamiltonian Monte Carlo (HMC) is a powerful tool for Bayesian computation. In comparison with the traditional Metropolis-Hastings algorithm, HMC offers greater computational efficiency, especially in higher dimensional or more complex modeling situations. To most statisticians, however, the idea of HMC comes from a less familiar origin, one that is based on the theory of classical mechanics. Its implementation, either through Stan or one of its derivative programs, can appear opaque to beginners. A lack of understanding of the inner working of HMC, in our opinion, has hindered its application to a broader range of statistical problems. In this article, we review the basic concepts of HMC in a language that is more familiar to statisticians, and we describe an HMC implementation in R, one of the most frequently used statistical software environments. We also present hmclearn, an R package for learning HMC. This package contains a general-purpose HMC function for data analysis. We illustrate the use of this package in common statistical models. In doing so, we hope to promote this powerful computational tool for wider use. Example code for common statistical models is presented as supplementary material for online publication.
Gaussian Markov random fields (GMRFs) are popular for modeling dependence in large areal datasets due to their ease of interpretation and computational convenience afforded by the sparse precision matrices needed for random variable generation. Typically in Bayesian computation, GMRFs are updated jointly in a block Gibbs sampler or componentwise in a single-site sampler via the full conditional distributions. The former approach can speed convergence by updating correlated variables all at once, while the latter avoids solving large matrices. We consider a sampling approach in which the underlying graph can be cut so that conditionally independent sites are updated simultaneously. This algorithm allows a practitioner to parallelize updates of subsets of locations or to take advantage of 'vectorized' calculations in a high-level language such as R. Through both simulated and real data, we demonstrate computational savings that can be achieved versus both single-site and block updating, regardless of whether the data are on a regular or an irregular lattice. The approach provides a good compromise between statistical and computational efficiency and is accessible to statisticians without expertise in numerical analysis or advanced computing.
Sample size derivation is a crucial element of planning any confirmatory trial. The required sample size is typically derived based on constraints on the maximal acceptable Type I error rate and minimal desired power. Power depends on the unknown true effect and tends to be calculated either for the smallest relevant effect or a likely point alternative. The former might be problematic if the minimal relevant effect is close to the null, thus requiring an excessively large sample size, while the latter is dubious since it does not account for the a priori uncertainty about the likely alternative effect. A Bayesian perspective on sample size derivation for a frequentist trial can reconcile arguments about the relative a priori plausibility of alternative effects with ideas based on the relevance of effect sizes. Many suggestions as to how such "hybrid" approaches could be implemented in practice have been put forward. However, key quantities are often defined in subtly different ways in the literature. Starting from the traditional entirely frequentist approach to sample size derivation, we derive consistent definitions for the most commonly used hybrid quantities and highlight connections, before discussing and demonstrating their use in sample size derivation for clinical trials.
Personalized medicine asks if a new treatment will help a particular patient, rather than if it improves the average response in a population. Without a causal model to distinguish these questions, interpretational mistakes arise. These mistakes are seen in an article by Demidenko [2016] that recommends the "D-value," which is the probability that a randomly chosen person from the new-treatment group has a higher value for the outcome than a randomly chosen person from the control-treatment group. The abstract states "The D-value has a clear interpretation as the proportion of patients who get worse after the treatment" with similar assertions appearing later. We show these statements are incorrect because they require assumptions about the potential outcomes which are neither testable in randomized experiments nor plausible in general. The D-value will not equal the proportion of patients who get worse after treatment if (as expected) those outcomes are correlated. Independence of potential outcomes is unrealistic and eliminates any personalized treatment effects; with dependence, the D-value can even imply treatment is better than control even though most patients are harmed by the treatment. Thus, D-values are misleading for personalized medicine. To prevent misunderstandings, we advise incorporating causal models into basic statistics education.