Abstract:Propensity score plays a central role in causal inference, but its use is not limited to causal comparisons. As a covariate balancing tool, propensity score can be used for controlled descriptive comparisons between groups whose memberships are not manipulable. A prominent example is racial disparities in health care. However, conceptual confusion and hesitation persists for using propensity score in racial disparities studies. In this commentary, we argue that propensity score, possibly combined with other methods, is an effective tool for racial disparities analysis. We describe relevant estimands, target population, and assumptions. In particular, we clarify that a controlled descriptive comparison requires weaker assumptions than a causal comparison. We discuss three common propensity score weighting strategies: overlap weighting, inverse probability weighting and average treatment effect for treated weighting. We further describe how to combine weighting with the rank-and-replace adjustment method to produce racial disparity estimates concordant to the Institute of Medicine’s definition. The method is illustrated by a re-analysis of the Medical Expenditure Panel Survey data.
{"title":"Using propensity scores for racial disparities analysis","authors":"Fan Li","doi":"10.1353/obs.2023.0005","DOIUrl":"https://doi.org/10.1353/obs.2023.0005","url":null,"abstract":"Abstract:Propensity score plays a central role in causal inference, but its use is not limited to causal comparisons. As a covariate balancing tool, propensity score can be used for controlled descriptive comparisons between groups whose memberships are not manipulable. A prominent example is racial disparities in health care. However, conceptual confusion and hesitation persists for using propensity score in racial disparities studies. In this commentary, we argue that propensity score, possibly combined with other methods, is an effective tool for racial disparities analysis. We describe relevant estimands, target population, and assumptions. In particular, we clarify that a controlled descriptive comparison requires weaker assumptions than a causal comparison. We discuss three common propensity score weighting strategies: overlap weighting, inverse probability weighting and average treatment effect for treated weighting. We further describe how to combine weighting with the rank-and-replace adjustment method to produce racial disparity estimates concordant to the Institute of Medicine’s definition. The method is illustrated by a re-analysis of the Medical Expenditure Panel Survey data.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"9 1","pages":"59 - 68"},"PeriodicalIF":0.0,"publicationDate":"2022-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45405608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract:About forty years ago, in a now–seminal contribution, Rosenbaum and Rubin (1983) introduced a critical characterization of the propensity score as a central quantity for drawing causal inferences in observational study settings. In the decades since, much progress has been made across several research frontiers in causal inference, notably including the re-weighting and matching paradigms. Focusing on the former and specifically on its intersection with machine learning and semiparametric efficiency theory, we re-examine the role of the propensity score in modern methodological developments. As Rosenbaum and Rubin (1983)’s contribution spurred a focus on the balancing property of the propensity score, we re-examine the degree to which and how this property plays a role in the development of asymptotically efficient estimators of causal effects; moreover, we discuss a connection between the balancing property and efficient estimation in the form of score equations and propose a score test for evaluating whether an estimator achieves empirical balance.
{"title":"Revisiting the Propensity Score’s Central Role: Towards Bridging Balance and Efficiency in the Era of Causal Machine Learning","authors":"N. Hejazi, M. J. van der Laan","doi":"10.1353/obs.2023.0001","DOIUrl":"https://doi.org/10.1353/obs.2023.0001","url":null,"abstract":"Abstract:About forty years ago, in a now–seminal contribution, Rosenbaum and Rubin (1983) introduced a critical characterization of the propensity score as a central quantity for drawing causal inferences in observational study settings. In the decades since, much progress has been made across several research frontiers in causal inference, notably including the re-weighting and matching paradigms. Focusing on the former and specifically on its intersection with machine learning and semiparametric efficiency theory, we re-examine the role of the propensity score in modern methodological developments. As Rosenbaum and Rubin (1983)’s contribution spurred a focus on the balancing property of the propensity score, we re-examine the degree to which and how this property plays a role in the development of asymptotically efficient estimators of causal effects; moreover, we discuss a connection between the balancing property and efficient estimation in the form of score equations and propose a score test for evaluating whether an estimator achieves empirical balance.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"9 1","pages":"23 - 34"},"PeriodicalIF":0.0,"publicationDate":"2022-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48027197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract:The paper presents some models for the propensity score. Considerable attention is given to a recently popular, but relatively under-explored setting in causal inference where the no-interference assumption does not hold. We lay out some key challenges in propensity score modeling under interference and present a few promising models based on existing works on mixed effects models.
{"title":"Propensity Score Modeling: Key Challenges When Moving Beyond the No-Interference Assumption","authors":"Hyunseung Kang, Chan Park, R. Trane","doi":"10.1353/obs.2023.0003","DOIUrl":"https://doi.org/10.1353/obs.2023.0003","url":null,"abstract":"Abstract:The paper presents some models for the propensity score. Considerable attention is given to a recently popular, but relatively under-explored setting in causal inference where the no-interference assumption does not hold. We lay out some key challenges in propensity score modeling under interference and present a few promising models based on existing works on mixed effects models.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"9 1","pages":"43 - 53"},"PeriodicalIF":0.0,"publicationDate":"2022-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47342181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract:The Mann-Whitney test is a popular nonparametric test for comparing two samples. It has been recently extended by Satten et al. (2018) to allow testing for the existence of treatment effects in observational studies. Their proposed adjusted Mann-Whitney test relies on the unconfoundedness assumption which is untestable in practice. It hence becomes important to assess the impact of violating this assumption on the degree to which causal conclusions remain valid. In this paper, we consider a marginal sensitivity analysis framework to address this problem by utilizing a bootstrap approach that provides a sensitivity interval for the estimand with a guaranteed coverage probability as long as the data generating mechanism is included in the set of pre-specified sensitivity models. We develop efficient optimization algorithms for computing the sensitivity interval and further extend our approach to a general class of adjusted multi-sample U-statistics. Simulation studies and two real data applications are discussed to demonstrate the utility of our proposed methodology.
{"title":"Sensitivity Analysis for the Adjusted Mann-Whitney Test with Observational Studies","authors":"Maozhu Dai, Weining Shen, H. Stern","doi":"10.1353/obs.2022.0002","DOIUrl":"https://doi.org/10.1353/obs.2022.0002","url":null,"abstract":"Abstract:The Mann-Whitney test is a popular nonparametric test for comparing two samples. It has been recently extended by Satten et al. (2018) to allow testing for the existence of treatment effects in observational studies. Their proposed adjusted Mann-Whitney test relies on the unconfoundedness assumption which is untestable in practice. It hence becomes important to assess the impact of violating this assumption on the degree to which causal conclusions remain valid. In this paper, we consider a marginal sensitivity analysis framework to address this problem by utilizing a bootstrap approach that provides a sensitivity interval for the estimand with a guaranteed coverage probability as long as the data generating mechanism is included in the set of pre-specified sensitivity models. We develop efficient optimization algorithms for computing the sensitivity interval and further extend our approach to a general class of adjusted multi-sample U-statistics. Simulation studies and two real data applications are discussed to demonstrate the utility of our proposed methodology.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"8 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2022-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46940922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michela Bia, Alfonso Flores-Lagunes, Andrea Mercatanti
Abstract:In a world increasingly globalized, multiple language skills can create more employment opportunities. Several countries include language training programs in active labor market programs for the unemployed. We analyze the effects of a language training program on the re-employment probability and hourly wages simultaneously, using high-quality administrative data from Luxembourg. We address selection into training with an unconfoundedness assumption and account for the complication that wages are “truncated” by unemployment by adopting a principal stratification framework. Estimation is undertaken with a mixture model likelihood-based approach. To improve inference, we use the individual’s hours worked as a secondary outcome and a stochastic dominance assumption. These two features considerably ameliorate the multimodality problem commonly encountered in mixture models. We also conduct a sensitivity analysis to assess the unconfoundedness assumption. Our results suggest a positive effect (of up to 12.7 percent) of the language training programs on the re-employment probability, but no effects on wages for those who are observed employed regardless of training participation.
{"title":"Evaluation of Language Training Programs in Luxembourg using Principal Stratification","authors":"Michela Bia, Alfonso Flores-Lagunes, Andrea Mercatanti","doi":"10.2139/ssrn.3538309","DOIUrl":"https://doi.org/10.2139/ssrn.3538309","url":null,"abstract":"Abstract:In a world increasingly globalized, multiple language skills can create more employment opportunities. Several countries include language training programs in active labor market programs for the unemployed. We analyze the effects of a language training program on the re-employment probability and hourly wages simultaneously, using high-quality administrative data from Luxembourg. We address selection into training with an unconfoundedness assumption and account for the complication that wages are “truncated” by unemployment by adopting a principal stratification framework. Estimation is undertaken with a mixture model likelihood-based approach. To improve inference, we use the individual’s hours worked as a secondary outcome and a stochastic dominance assumption. These two features considerably ameliorate the multimodality problem commonly encountered in mixture models. We also conduct a sensitivity analysis to assess the unconfoundedness assumption. Our results suggest a positive effect (of up to 12.7 percent) of the language training programs on the re-employment probability, but no effects on wages for those who are observed employed regardless of training participation.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"8 1","pages":"1 - 44"},"PeriodicalIF":0.0,"publicationDate":"2022-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42577592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Tompsett, S. Vansteelandt, O. Dukes, B. D. De Stavola
Abstract:In this paper we present gesttools, a series of general purpose, user friendly functions with which to perform g-estimation of structural nested mean models (SNMMs) for time-varying exposures and outcomes in R. The package implements the g-estimation methods found in Vansteelandt and Sjolander (2016) and Dukes and Vansteelandt (2018), and is capable of analysing both end of study and time-varying outcome data that are either binary or continuous, or exposure variables that are either binary, continuous, or categorical. It also allows for the fitting of SNMMs with time-varying causal effects, effect modification by other variables, or both, as well as support for censored data using inverse weighting. We outline the theory underpinning these methods, as well as describing the SNMMs that can be fitted by the software. The package is demonstrated using simulated, and real-world inspired datasets.
{"title":"gesttools: General Purpose G-Estimation in R","authors":"Daniel Tompsett, S. Vansteelandt, O. Dukes, B. D. De Stavola","doi":"10.1353/obs.2022.0003","DOIUrl":"https://doi.org/10.1353/obs.2022.0003","url":null,"abstract":"Abstract:In this paper we present gesttools, a series of general purpose, user friendly functions with which to perform g-estimation of structural nested mean models (SNMMs) for time-varying exposures and outcomes in R. The package implements the g-estimation methods found in Vansteelandt and Sjolander (2016) and Dukes and Vansteelandt (2018), and is capable of analysing both end of study and time-varying outcome data that are either binary or continuous, or exposure variables that are either binary, continuous, or categorical. It also allows for the fitting of SNMMs with time-varying causal effects, effect modification by other variables, or both, as well as support for censored data using inverse weighting. We outline the theory underpinning these methods, as well as describing the SNMMs that can be fitted by the software. The package is demonstrated using simulated, and real-world inspired datasets.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"8 1","pages":"1 - 28"},"PeriodicalIF":0.0,"publicationDate":"2022-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47866713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The interrupted time series design was introduced to social scientists in 1963 by Campbell and Stanley, analysis methods were proposed by Box and Tiao in 1975, and more recent treatments are easily found (Box et al., 2016). Despite its popularity, current results in statistics reveal fundamental oversights in the standard statistical methods employed. Adaptive model selection built into recommended practice causes challenging problems for post-model-selection-inference. What one might call model cherry picking can invalidate conventional statistical inference, statistical tests and confidence intervals with damaging consequences for causal inference. There are technical developments that can correct for these problems, but these remedies raise conceptual difficulties for causal inference when proper estimands are defined. The issues are illustrated with an analysis of the impact of an assault weapons ban on daily handgun sales in California from 1996 through 2018. Statistically valid regression functionals are obtained, but their causal meaning is unclear. Researchers might be best served by interpreting only the sign of such functionals.
中断时间序列设计于1963年由Campbell和Stanley引入社会科学家,分析方法由Box和Tiao于1975年提出,最近的治疗方法很容易找到(Box et al., 2016)。尽管它很受欢迎,但目前的统计结果显示,所采用的标准统计方法存在根本性的疏忽。在推荐实践中建立的自适应模型选择为后模型选择推理带来了挑战性问题。人们可能会称之为“模型挑选”,它会使传统的统计推断、统计测试和置信区间无效,并对因果推断产生破坏性后果。有技术上的发展可以纠正这些问题,但这些补救措施在定义适当的估计时,会给因果推理带来概念上的困难。通过分析1996年至2018年加州禁止每日手枪销售的攻击性武器禁令的影响,可以说明这些问题。得到了统计上有效的回归函数,但其因果意义尚不清楚。研究人员最好只解释这些功能的符号。
{"title":"Causal Inference Challenges with Interrupted Time Series Designs: An Evaluation of an Assault Weapons Ban in California","authors":"R. Berk","doi":"10.1353/obs.0.0001","DOIUrl":"https://doi.org/10.1353/obs.0.0001","url":null,"abstract":"The interrupted time series design was introduced to social scientists in 1963 by Campbell and Stanley, analysis methods were proposed by Box and Tiao in 1975, and more recent treatments are easily found (Box et al., 2016). Despite its popularity, current results in statistics reveal fundamental oversights in the standard statistical methods employed. Adaptive model selection built into recommended practice causes challenging problems for post-model-selection-inference. What one might call model cherry picking can invalidate conventional statistical inference, statistical tests and confidence intervals with damaging consequences for causal inference. There are technical developments that can correct for these problems, but these remedies raise conceptual difficulties for causal inference when proper estimands are defined. The issues are illustrated with an analysis of the impact of an assault weapons ban on daily handgun sales in California from 1996 through 2018. Statistically valid regression functionals are obtained, but their causal meaning is unclear. Researchers might be best served by interpreting only the sign of such functionals.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"1 1","pages":"-"},"PeriodicalIF":0.0,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66460798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract:Rosenbaum and Rubin (1983) introduced the notion of the propensity score and discussed its central role in causal inference with observational studies. Their paper, however, caused a fundamental incoherence with an early paper by Rubin (1978), which showed that the propensity score does not play any role in the Bayesian analysis of unconfounded observational studies if the priors on the propensity score and outcome models are independent. Despite the serious efforts made in the literature, it is generally difficult to reconcile these contradicting results. We offer a simple approach to incorporating the propensity score in Bayesian causal inference based on the posterior predictive p-value. To motivate a simple procedure, we focus on the model with the strong null hypothesis of no causal effects for any units whatsoever. Computationally, the proposed posterior predictive p-value equals the classic p-value based on the Fisher randomization test averaged over the posterior predictive distribution of the propensity score. Moreover, using the studentized doubly robust estimator as the test statistic, the proposed p-value inherits the doubly robust property and is also asymptotically valid for testing the weak null hypothesis of zero average causal effect. Perhaps surprisingly, this Bayesianly motivated p-value can have better frequentist’s finite-sample performance than the frequentist’s p-value based on the asymptotic approximation especially when the propensity scores can take extreme values.
{"title":"Posterior Predictive Propensity Scores and p-Values","authors":"Peng Ding, Tianyu Guo","doi":"10.1353/obs.2023.0015","DOIUrl":"https://doi.org/10.1353/obs.2023.0015","url":null,"abstract":"Abstract:Rosenbaum and Rubin (1983) introduced the notion of the propensity score and discussed its central role in causal inference with observational studies. Their paper, however, caused a fundamental incoherence with an early paper by Rubin (1978), which showed that the propensity score does not play any role in the Bayesian analysis of unconfounded observational studies if the priors on the propensity score and outcome models are independent. Despite the serious efforts made in the literature, it is generally difficult to reconcile these contradicting results. We offer a simple approach to incorporating the propensity score in Bayesian causal inference based on the posterior predictive p-value. To motivate a simple procedure, we focus on the model with the strong null hypothesis of no causal effects for any units whatsoever. Computationally, the proposed posterior predictive p-value equals the classic p-value based on the Fisher randomization test averaged over the posterior predictive distribution of the propensity score. Moreover, using the studentized doubly robust estimator as the test statistic, the proposed p-value inherits the doubly robust property and is also asymptotically valid for testing the weak null hypothesis of zero average causal effect. Perhaps surprisingly, this Bayesianly motivated p-value can have better frequentist’s finite-sample performance than the frequentist’s p-value based on the asymptotic approximation especially when the propensity scores can take extreme values.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"9 1","pages":"18 - 3"},"PeriodicalIF":0.0,"publicationDate":"2022-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49379109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract:In Leo Breiman's influential article "Statistical modeling-the two cultures" he identified two cultures for statistical practices. The data modeling culture (DMC) denotes practices tailored for statistical inference targeting a quantity of interest, [inline-graphic 01]. The algorithmic modeling culture (AMC) refers to practices defining an algorithm, or a machine-learning (ML) procedure, that generates accurate predictions about an outcome of interest, [inline-graphic 02] was the dominant mode, Breiman argued that statisticians should give more attention to AMC. Twenty years later and energized by two revolutions—one in data-science and one in causal inference—a hybrid modeling culture (HMC) is rising. HMC fuses the inferential strength of DMC and the predictive power of AMC with the goal of analyzing cause and effect, and thus, HMC's quantity of interest is causal effect, [inline-graphic 03]. In combining inference and prediction, the result of HMC practices is that the distinction between prediction and inference, taken to its limit, melts away. While this hybrid culture does not occupy the default mode of scientific practices, we argue that it offers an intriguing novel path for applied sciences.
{"title":"Melting together prediction and inference","authors":"A. Daoud, Devdatt P. Dubhashi","doi":"10.1353/obs.2021.0035","DOIUrl":"https://doi.org/10.1353/obs.2021.0035","url":null,"abstract":"Abstract:In Leo Breiman's influential article \"Statistical modeling-the two cultures\" he identified two cultures for statistical practices. The data modeling culture (DMC) denotes practices tailored for statistical inference targeting a quantity of interest, [inline-graphic 01]. The algorithmic modeling culture (AMC) refers to practices defining an algorithm, or a machine-learning (ML) procedure, that generates accurate predictions about an outcome of interest, [inline-graphic 02] was the dominant mode, Breiman argued that statisticians should give more attention to AMC. Twenty years later and energized by two revolutions—one in data-science and one in causal inference—a hybrid modeling culture (HMC) is rising. HMC fuses the inferential strength of DMC and the predictive power of AMC with the goal of analyzing cause and effect, and thus, HMC's quantity of interest is causal effect, [inline-graphic 03]. In combining inference and prediction, the result of HMC practices is that the distinction between prediction and inference, taken to its limit, melts away. While this hybrid culture does not occupy the default mode of scientific practices, we argue that it offers an intriguing novel path for applied sciences.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"7 1","pages":"1 - 7"},"PeriodicalIF":0.0,"publicationDate":"2021-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45501762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract:Causal analyses for observational studies are often complicated by covariate imbalances among treatment groups, and matching methodologies alleviate this complication by finding subsets of treatment groups that exhibit covariate balance. It is widely agreed upon that covariate balance can serve as evidence that a matched dataset approximates a randomized experiment, but what kind of experiment does a matched dataset approximate? In this work, we develop a randomization test for the hypothesis that a matched dataset approximates a particular experimental design, such as complete randomization, block randomization, or rerandomization. Our test can incorporate any experimental design, and it allows for a graphical display that puts several designs on the same univariate scale, thereby allowing researchers to pinpoint which design—if any—is most appropriate for a matched dataset. After researchers determine a plausible design, we recommend a randomization based approach for analyzing the matched data, which can incorporate any design and treatment effect estimator. Through simulation, we find that our test can frequently detect violations of randomized assignment that harm inferential results. Furthermore, through simulation and a real application in political science, we find that matched datasets with high levels of covariate balance tend to approximate balance-constrained designs like rerandomization, and analyzing them as such can lead to precise causal analyses. However, assuming a precise design should be proceeded with caution, because it can harm inferential results if there are still substantial biases due to remaining imbalances after matching. Our approach is implemented in the randChecks R package, available on CRAN.
{"title":"Randomization Tests to Assess Covariate Balance When Designing and Analyzing Matched Datasets","authors":"Zach Branson","doi":"10.1353/obs.2021.0031","DOIUrl":"https://doi.org/10.1353/obs.2021.0031","url":null,"abstract":"Abstract:Causal analyses for observational studies are often complicated by covariate imbalances among treatment groups, and matching methodologies alleviate this complication by finding subsets of treatment groups that exhibit covariate balance. It is widely agreed upon that covariate balance can serve as evidence that a matched dataset approximates a randomized experiment, but what kind of experiment does a matched dataset approximate? In this work, we develop a randomization test for the hypothesis that a matched dataset approximates a particular experimental design, such as complete randomization, block randomization, or rerandomization. Our test can incorporate any experimental design, and it allows for a graphical display that puts several designs on the same univariate scale, thereby allowing researchers to pinpoint which design—if any—is most appropriate for a matched dataset. After researchers determine a plausible design, we recommend a randomization based approach for analyzing the matched data, which can incorporate any design and treatment effect estimator. Through simulation, we find that our test can frequently detect violations of randomized assignment that harm inferential results. Furthermore, through simulation and a real application in political science, we find that matched datasets with high levels of covariate balance tend to approximate balance-constrained designs like rerandomization, and analyzing them as such can lead to precise causal analyses. However, assuming a precise design should be proceeded with caution, because it can harm inferential results if there are still substantial biases due to remaining imbalances after matching. Our approach is implemented in the randChecks R package, available on CRAN.","PeriodicalId":74335,"journal":{"name":"Observational studies","volume":"7 1","pages":"1 - 36"},"PeriodicalIF":0.0,"publicationDate":"2021-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44904534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}