Pub Date : 2013-01-01DOI: 10.1080/00031305.2013.842498
Yeyi Zhu, Ladia M Hernandez, Peter Mueller, Yongquan Dong, Michele R Forman
The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.
{"title":"Data Acquisition and Preprocessing in Studies on Humans: What Is Not Taught in Statistics Classes?","authors":"Yeyi Zhu, Ladia M Hernandez, Peter Mueller, Yongquan Dong, Michele R Forman","doi":"10.1080/00031305.2013.842498","DOIUrl":"10.1080/00031305.2013.842498","url":null,"abstract":"<p><p>The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"67 4","pages":"235-241"},"PeriodicalIF":1.8,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3912269/pdf/nihms537499.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32104198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
At present, there are many software procedures available that enable statisticians to fit linear mixed models (LMMs) to continuous dependent variables in clustered or longitudinal datasets. LMMs are flexible tools for analyzing relationships among variables in these types of datasets, in that a variety of covariance structures can be used depending on the subject matter under study. The explicit random effects in LMMs allow analysts to make inferences about the variability between clusters or subjects in larger hypothetical populations, and examine cluster- or subject-level variables that explain portions of this variability. These models can also be used to analyze longitudinal or clustered datasets with data that are missing at random (MAR), and can accommodate time-varying covariates in longitudinal datasets. Although the software procedures currently available have many features in common, more specific analytic aspects of fitting LMMs (e.g., crossed random effects, appropriate hypothesis testing for variance components, diagnostics, incorporating sampling weights) may only be available in selected software procedures. With this article, we aim to perform a comprehensive and up-to-date comparison of the current capabilities of software procedures for fitting LMMs, and provide statisticians with a guide for selecting a software procedure appropriate for their analytic goals.
{"title":"An Overview of Current Software Procedures for Fitting Linear Mixed Models.","authors":"Brady T West, Andrzej T Galecki","doi":"10.1198/tas.2011.11077","DOIUrl":"https://doi.org/10.1198/tas.2011.11077","url":null,"abstract":"At present, there are many software procedures available that enable statisticians to fit linear mixed models (LMMs) to continuous dependent variables in clustered or longitudinal datasets. LMMs are flexible tools for analyzing relationships among variables in these types of datasets, in that a variety of covariance structures can be used depending on the subject matter under study. The explicit random effects in LMMs allow analysts to make inferences about the variability between clusters or subjects in larger hypothetical populations, and examine cluster- or subject-level variables that explain portions of this variability. These models can also be used to analyze longitudinal or clustered datasets with data that are missing at random (MAR), and can accommodate time-varying covariates in longitudinal datasets. Although the software procedures currently available have many features in common, more specific analytic aspects of fitting LMMs (e.g., crossed random effects, appropriate hypothesis testing for variance components, diagnostics, incorporating sampling weights) may only be available in selected software procedures. With this article, we aim to perform a comprehensive and up-to-date comparison of the current capabilities of software procedures for fitting LMMs, and provide statisticians with a guide for selecting a software procedure appropriate for their analytic goals.","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 4","pages":"274-282"},"PeriodicalIF":1.8,"publicationDate":"2012-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/tas.2011.11077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31375746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-01-01Epub Date: 2012-03-21DOI: 10.1080/00031305.2012.676329
Bailey K Fosdick, Adrian E Raftery
We consider the problem of estimating the correlation in bivariate normal data when the means and variances are assumed known, with emphasis on the small sample case. We consider eight different estimators, several of them considered here for the first time in the literature. In a simulation study, we found that Bayesian estimators using the uniform and arc-sine priors outperformed several empirical and exact or approximate maximum likelihood estimators in small samples. The arc-sine prior did better for large values of the correlation. For testing whether the correlation is zero, we found that Bayesian hypothesis tests outperformed significance tests based on the empirical and exact or approximate maximum likelihood estimators considered in small samples, but that all tests performed similarly for sample size 50. These results lead us to suggest using the posterior mean with the arc-sine prior to estimate the correlation in small samples when the variances are assumed known.
{"title":"Estimating the Correlation in Bivariate Normal Data with Known Variances and Small Sample Sizes().","authors":"Bailey K Fosdick, Adrian E Raftery","doi":"10.1080/00031305.2012.676329","DOIUrl":"https://doi.org/10.1080/00031305.2012.676329","url":null,"abstract":"<p><p>We consider the problem of estimating the correlation in bivariate normal data when the means and variances are assumed known, with emphasis on the small sample case. We consider eight different estimators, several of them considered here for the first time in the literature. In a simulation study, we found that Bayesian estimators using the uniform and arc-sine priors outperformed several empirical and exact or approximate maximum likelihood estimators in small samples. The arc-sine prior did better for large values of the correlation. For testing whether the correlation is zero, we found that Bayesian hypothesis tests outperformed significance tests based on the empirical and exact or approximate maximum likelihood estimators considered in small samples, but that all tests performed similarly for sample size 50. These results lead us to suggest using the posterior mean with the arc-sine prior to estimate the correlation in small samples when the variances are assumed known.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 1","pages":"34-41"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.676329","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31302836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-01-01DOI: 10.1080/00031305.2012.720900
Theodore G Karrison, Mark J Ratain, Walter M Stadler, Gary L Rosner
The randomized discontinuation trial (RDT) design is an enrichment-type design that has been used in a variety of diseases to evaluate the efficacy of new treatments. The RDT design seeks to select a more homogeneous group of patients, consisting of those who are more likely to show a treatment benefit if one exists. In oncology, the RDT design has been applied to evaluate the effects of cytostatic agents, that is, drugs that act primarily by slowing tumor growth rather than shrinking tumors. In the RDT design, all patients receive treatment during an initial, open-label run-in period of duration T. Patients with objective response (substantial tumor shrinkage) remain on therapy while those with early progressive disease are removed from the trial. Patients with stable disease (SD) are then randomized to either continue active treatment or switched to placebo. The main analysis compares outcomes, for example, progression-free survival (PFS), between the two randomized arms. As a secondary objective, investigators may seek to estimate PFS for all treated patients, measured from the time of entry into the study, by combining information from the run-in and post run-in periods. For t ≤ T, PFS is estimated by the observed proportion of patients who are progression-free among all patients enrolled. For t > T, the estimate can be expressed as Ŝ(t) = p̂OR × ŜOR(t - T) + p̂SD × ŜSD(t - T), where p̂OR is the estimated probability of response during the run-in period, p̂SD is the estimated probability of SD, and ŜOR(t - T) and ŜSD(t - T) are the Kaplan-Meier estimates of subsequent PFS in the responders and patients with SD randomized to continue treatment, respectively. In this article, we derive the variance of Ŝ(t), enabling the construction of confidence intervals for both S(t) and the median survival time. Simulation results indicate that the method provides accurate coverage rates. An interesting aspect of the design is that outcomes during the run-in phase have a negative multinomial distribution, something not frequently encountered in practice.
{"title":"Estimation of Progression-Free Survival for All Treated Patients in the Randomized Discontinuation Trial Design.","authors":"Theodore G Karrison, Mark J Ratain, Walter M Stadler, Gary L Rosner","doi":"10.1080/00031305.2012.720900","DOIUrl":"https://doi.org/10.1080/00031305.2012.720900","url":null,"abstract":"<p><p>The randomized discontinuation trial (RDT) design is an enrichment-type design that has been used in a variety of diseases to evaluate the efficacy of new treatments. The RDT design seeks to select a more homogeneous group of patients, consisting of those who are more likely to show a treatment benefit if one exists. In oncology, the RDT design has been applied to evaluate the effects of cytostatic agents, that is, drugs that act primarily by slowing tumor growth rather than shrinking tumors. In the RDT design, all patients receive treatment during an initial, open-label run-in period of duration <i>T</i>. Patients with objective response (substantial tumor shrinkage) remain on therapy while those with early progressive disease are removed from the trial. Patients with stable disease (SD) are then randomized to either continue active treatment or switched to placebo. The main analysis compares outcomes, for example, progression-free survival (PFS), between the two randomized arms. As a secondary objective, investigators may seek to estimate PFS for all treated patients, measured from the time of entry into the study, by combining information from the run-in and post run-in periods. For <i>t ≤ T</i>, PFS is estimated by the observed proportion of patients who are progression-free among all patients enrolled. For <i>t > T</i>, the estimate can be expressed as <i>Ŝ</i>(<i>t</i>) = <i>p̂</i><sub>OR</sub> × <i>Ŝ</i><sub>OR</sub>(<i>t - T</i>) + <i>p̂</i><sub>SD</sub> × <i>Ŝ</i><sub>SD</sub>(<i>t - T</i>), where <i>p̂</i><sub>OR</sub> is the estimated probability of response during the run-in period, <i>p̂</i><sub>SD</sub> is the estimated probability of SD, and <i>Ŝ</i><sub>OR</sub>(<i>t - T</i>) and <i>Ŝ</i><sub>SD</sub>(<i>t - T</i>) are the Kaplan-Meier estimates of subsequent PFS in the responders and patients with SD randomized to continue treatment, respectively. In this article, we derive the variance of <i>Ŝ</i>(<i>t</i>), enabling the construction of confidence intervals for both <i>S</i>(<i>t</i>) and the median survival time. Simulation results indicate that the method provides accurate coverage rates. An interesting aspect of the design is that outcomes during the run-in phase have a negative multinomial distribution, something not frequently encountered in practice.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 3","pages":"155-162"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.720900","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31736474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-01-01DOI: 10.1080/00031305.2012.703873
Mehmet Kocak, Arzu Onar-Thomas
Cox proportional hazards (PH) models are commonly used in medical research to investigate the associations between covariates and time to event outcomes. It is frequently noted that with less than ten events per covariate, these models produce spurious results, and therefore, should not be used. Statistical literature contains asymptotic power formulae for the Cox model which can be used to determine the number of events needed to detect an association. Here we investigate via simulations the performance of these formulae in small sample settings for Cox models with 1- or 2-covariates. Our simulations indicate that, when the number of events is small, the power estimate based on the asymptotic formulae is often inflated. The discrepancy between the asymptotic and empirical power is larger for the dichotomous covariate especially in cases where allocation of sample size to its levels is unequal. When more than one covariate is included in the same model, the discrepancy between the asymptotic power and the empirical power is even larger, especially when a high positive correlation exists between the two covariates.
{"title":"A Simulation Based Evaluation of the Asymptotic Power Formulae for Cox Models in Small Sample Cases.","authors":"Mehmet Kocak, Arzu Onar-Thomas","doi":"10.1080/00031305.2012.703873","DOIUrl":"https://doi.org/10.1080/00031305.2012.703873","url":null,"abstract":"<p><p>Cox proportional hazards (PH) models are commonly used in medical research to investigate the associations between covariates and time to event outcomes. It is frequently noted that with less than ten events per covariate, these models produce spurious results, and therefore, should not be used. Statistical literature contains asymptotic power formulae for the Cox model which can be used to determine the number of events needed to detect an association. Here we investigate via simulations the performance of these formulae in small sample settings for Cox models with 1- or 2-covariates. Our simulations indicate that, when the number of events is small, the power estimate based on the asymptotic formulae is often inflated. The discrepancy between the asymptotic and empirical power is larger for the dichotomous covariate especially in cases where allocation of sample size to its levels is unequal. When more than one covariate is included in the same model, the discrepancy between the asymptotic power and the empirical power is even larger, especially when a high positive correlation exists between the two covariates.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 3","pages":"173-179"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.703873","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31798842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-01-01Epub Date: 2012-06-12DOI: 10.1080/00031305.2012.671724
Robert S Poulson, Gary L Gadbury, David B Allison
Plausibility of high variability in treatment effects across individuals has been recognized as an important consideration in clinical studies. Surprisingly, little attention has been given to evaluating this variability in design of clinical trials or analyses of resulting data. High variation in a treatment's efficacy or safety across individuals (referred to herein as treatment heterogeneity) may have important consequences because the optimal treatment choice for an individual may be different from that suggested by a study of average effects. We call this an individual qualitative interaction (IQI), borrowing terminology from earlier work - referring to a qualitative interaction (QI) being present when the optimal treatment varies across a"groups" of individuals. At least three techniques have been proposed to investigate treatment heterogeneity: techniques to detect a QI, use of measures such as the density overlap of two outcome variables under different treatments, and use of cross-over designs to observe "individual effects." We elucidate underlying connections among them, their limitations and some assumptions that may be required. We do so under a potential outcomes framework that can add insights to results from usual data analyses and to study design features that improve the capability to more directly assess treatment heterogeneity.
{"title":"Treatment Heterogeneity and Individual Qualitative Interaction.","authors":"Robert S Poulson, Gary L Gadbury, David B Allison","doi":"10.1080/00031305.2012.671724","DOIUrl":"https://doi.org/10.1080/00031305.2012.671724","url":null,"abstract":"<p><p>Plausibility of high variability in treatment effects across individuals has been recognized as an important consideration in clinical studies. Surprisingly, little attention has been given to evaluating this variability in design of clinical trials or analyses of resulting data. High variation in a treatment's efficacy or safety across individuals (referred to herein as treatment heterogeneity) may have important consequences because the optimal treatment choice for an individual may be different from that suggested by a study of average effects. We call this an individual qualitative interaction (IQI), borrowing terminology from earlier work - referring to a qualitative interaction (QI) being present when the optimal treatment varies across a\"groups\" of individuals. At least three techniques have been proposed to investigate treatment heterogeneity: techniques to detect a QI, use of measures such as the density overlap of two outcome variables under different treatments, and use of cross-over designs to observe \"individual effects.\" We elucidate underlying connections among them, their limitations and some assumptions that may be required. We do so under a potential outcomes framework that can add insights to results from usual data analyses and to study design features that improve the capability to more directly assess treatment heterogeneity.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 1","pages":"16-24"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.671724","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31092749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee
When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.
在使用平滑截断绝对偏差(SCAD)和自适应套索(Adaptive Lasso)等具有甲骨文特性的模型选择方法时,通常会通过 m 倍交叉验证来估计平滑参数,例如 m = 10。在真实回归函数稀疏、信号量大的问题中,这种交叉验证通常效果很好。然而,在涉及单核苷酸多态性(SNP)的基因组研究回归建模中,真正的回归函数虽然被认为是稀疏的,但信号并不大。我们通过实证证明,在此类问题中,使用 SCAD 和自适应套索法(10 倍交叉验证)所选变量的数量是一个随机变量,其变化相当大,令人惊讶。类似的结论也适用于 Lasso 等非oracle 方法。我们的研究强烈质疑对任何甲骨文方法(不仅仅是 SCAD 和 Adaptive Lasso)只进行一次 m 倍交叉验证是否合适。
{"title":"Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context.","authors":"Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee","doi":"10.1198/tas.2011.11052","DOIUrl":"10.1198/tas.2011.11052","url":null,"abstract":"<p><p>When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 4","pages":"223-228"},"PeriodicalIF":1.8,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3281424/pdf/nihms355303.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30470829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José R Zubizarreta, Caroline E Reinke, Rachel R Kelz, Jeffrey H Silber, Paul R Rosenbaum
Matching for several nominal covariates with many levels has usually been thought to be difficult because these covariates combine to form an enormous number of interaction categories with few if any people in most such categories. Moreover, because nominal variables are not ordered, there is often no notion of a "close substitute" when an exact match is unavailable. In a case-control study of the risk factors for read-mission within 30 days of surgery in the Medicare population, we wished to match for 47 hospitals, 15 surgical procedures grouped or nested within 5 procedure groups, two genders, or 47 × 15 × 2 = 1410 categories. In addition, we wished to match as closely as possible for the continuous variable age (65-80 years). There were 1380 readmitted patients or cases. A fractional factorial experiment may balance main effects and low-order interactions without achieving balance for high-order interactions. In an analogous fashion, we balance certain main effects and low-order interactions among the covariates; moreover, we use as many exactly matched pairs as possible. This is done by creating a match that is exact for several variables, with a close match for age, and both a "near-exact match" and a "finely balanced match" for another nominal variable, in this case a 47 × 5 = 235 category variable representing the interaction of the 47 hospitals and the five surgical procedure groups. The method is easily implemented in R.
{"title":"Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery.","authors":"José R Zubizarreta, Caroline E Reinke, Rachel R Kelz, Jeffrey H Silber, Paul R Rosenbaum","doi":"10.1198/tas.2011.11072","DOIUrl":"https://doi.org/10.1198/tas.2011.11072","url":null,"abstract":"<p><p>Matching for several nominal covariates with many levels has usually been thought to be difficult because these covariates combine to form an enormous number of interaction categories with few if any people in most such categories. Moreover, because nominal variables are not ordered, there is often no notion of a \"close substitute\" when an exact match is unavailable. In a case-control study of the risk factors for read-mission within 30 days of surgery in the Medicare population, we wished to match for 47 hospitals, 15 surgical procedures grouped or nested within 5 procedure groups, two genders, or 47 × 15 × 2 = 1410 categories. In addition, we wished to match as closely as possible for the continuous variable age (65-80 years). There were 1380 readmitted patients or cases. A fractional factorial experiment may balance main effects and low-order interactions without achieving balance for high-order interactions. In an analogous fashion, we balance certain main effects and low-order interactions among the covariates; moreover, we use as many exactly matched pairs as possible. This is done by creating a match that is exact for several variables, with a close match for age, and both a \"near-exact match\" and a \"finely balanced match\" for another nominal variable, in this case a 47 × 5 = 235 category variable representing the interaction of the 47 hospitals and the five surgical procedure groups. The method is easily implemented in R.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 4","pages":"229-238"},"PeriodicalIF":1.8,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/tas.2011.11072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32832138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effective component relabeling in Bayesian analyses of mixture models is critical to the routine use of mixtures in classification with analysis based on Markov chain Monte Carlo methods. The classification-based relabeling approach here is computationally attractive and statistically effective, and scales well with sample size and number of mixture components concordant with enabling routine analyses of increasingly large data sets. Building on the best of existing methods, practical relabeling aims to match data:component classification indicators in MCMC iterates with those of a defined reference mixture distribution. The method performs as well as or better than existing methods in small dimensional problems, while being practically superior in problems with larger data sets as the approach is scalable. We describe examples and computational benchmarks, and provide supporting code with efficient computational implementation of the algorithm that will be of use to others in practical applications of mixture models.
{"title":"Efficient Classification-Based Relabeling in Mixture Models.","authors":"Andrew J Cron, Mike West","doi":"10.1198/tast.2011.10170","DOIUrl":"https://doi.org/10.1198/tast.2011.10170","url":null,"abstract":"<p><p>Effective component relabeling in Bayesian analyses of mixture models is critical to the routine use of mixtures in classification with analysis based on Markov chain Monte Carlo methods. The classification-based relabeling approach here is computationally attractive and statistically effective, and scales well with sample size and number of mixture components concordant with enabling routine analyses of increasingly large data sets. Building on the best of existing methods, practical relabeling aims to match data:component classification indicators in MCMC iterates with those of a defined reference mixture distribution. The method performs as well as or better than existing methods in small dimensional problems, while being practically superior in problems with larger data sets as the approach is scalable. We describe examples and computational benchmarks, and provide supporting code with efficient computational implementation of the algorithm that will be of use to others in practical applications of mixture models.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 1","pages":"16-20"},"PeriodicalIF":1.8,"publicationDate":"2011-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/tast.2011.10170","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29927121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-01Epub Date: 2012-01-01DOI: 10.1198/tast.2011.08294
Bo Lu, Robert Greevy, Xinyi Xu, Cole Beck
Matching is a powerful statistical tool in design and analysis. Conventional two-group, or bipartite, matching has been widely used in practice. However, its utility is limited to simpler designs. In contrast, nonbipartite matching is not limited to the two-group case, handling multiparty matching situations. It can be used to find the set of matches that minimize the sum of distances based on a given distance matrix. It brings greater flexibility to the matching design, such as multigroup comparisons. Thanks to improvements in computing power and freely available algorithms to solve nonbipartite problems, the cost in terms of computation time and complexity is low. This article reviews the optimal nonbipartite matching algorithm and its statistical applications, including observational studies with complex designs and an exact distribution-free test comparing two multivariate distributions. We also introduce an R package that performs optimal nonbipartite matching. We present an easily accessible web application to make nonbipartite matching freely available to general researchers.
匹配是设计和分析中一个强大的统计工具。传统的两组或两方匹配在实践中得到了广泛应用。然而,它的作用仅限于较简单的设计。相比之下,非双方位匹配并不局限于两组情况,它可以处理多方匹配的情况。它可以根据给定的距离矩阵,找到使距离总和最小的匹配集合。它为匹配设计带来了更大的灵活性,例如多组比较。得益于计算能力的提高和可免费获得的解决非双方差问题的算法,非双方差问题在计算时间和复杂度方面的成本都很低。本文回顾了最优非双方差匹配算法及其统计应用,包括复杂设计的观察研究和比较两个多变量分布的精确无分布检验。我们还介绍了一个可执行最优非双方差匹配的 R 软件包。我们还介绍了一个易于访问的网络应用程序,使普通研究人员可以免费使用非双方差匹配算法。
{"title":"Optimal Nonbipartite Matching and Its Statistical Applications.","authors":"Bo Lu, Robert Greevy, Xinyi Xu, Cole Beck","doi":"10.1198/tast.2011.08294","DOIUrl":"10.1198/tast.2011.08294","url":null,"abstract":"<p><p>Matching is a powerful statistical tool in design and analysis. Conventional two-group, or bipartite, matching has been widely used in practice. However, its utility is limited to simpler designs. In contrast, nonbipartite matching is not limited to the two-group case, handling multiparty matching situations. It can be used to find the set of matches that minimize the sum of distances based on a given distance matrix. It brings greater flexibility to the matching design, such as multigroup comparisons. Thanks to improvements in computing power and freely available algorithms to solve nonbipartite problems, the cost in terms of computation time and complexity is low. This article reviews the optimal nonbipartite matching algorithm and its statistical applications, including observational studies with complex designs and an exact distribution-free test comparing two multivariate distributions. We also introduce an R package that performs optimal nonbipartite matching. We present an easily accessible web application to make nonbipartite matching freely available to general researchers.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 1","pages":"21-30"},"PeriodicalIF":1.8,"publicationDate":"2011-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3501247/pdf/nihms412698.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31070271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}