American Statistician最新文献

英文中文

Data Acquisition and Preprocessing in Studies on Humans: What Is Not Taught in Statistics Classes? 人类研究中的数据采集和预处理：统计学课上没教什么？

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2013-01-01 DOI: 10.1080/00031305.2013.842498

Yeyi Zhu, Ladia M Hernandez, Peter Mueller, Yongquan Dong, Michele R Forman

The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.

本文旨在讨论研究中的问题，这些问题可能是统计课程中缺失的，对（生物）统计专业的学生也很重要。在案例研究的背景下，我们讨论了数据采集和预处理步骤，这些步骤填补了主题科学家提出的研究问题与正式推论的统计方法之间的空白。问题包括参与者招募、数据收集培训和标准化、变量编码、数据审查和验证、数据清理和编辑以及文档记录。尽管这些细节在研究中至关重要，但在应用统计学课程中却很少讨论这些问题。缺乏更多正规培训的原因之一是很难系统地应对研究过程中可能出现的诸多挑战。本文通过使用一个说明性案例进行讨论，有助于弥合研究问题与正式统计推断之间的差距。我们希望，通过阅读和讨论本文以及练习数据预处理练习，统计专业的学生能对这些重要问题有更敏感的认识，从而实现研究的最佳开展、质量控制、分析和解释。

{"title":"Data Acquisition and Preprocessing in Studies on Humans: What Is Not Taught in Statistics Classes?","authors":"Yeyi Zhu, Ladia M Hernandez, Peter Mueller, Yongquan Dong, Michele R Forman","doi":"10.1080/00031305.2013.842498","DOIUrl":"10.1080/00031305.2013.842498","url":null,"abstract":"The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"67 4","pages":"235-241"},"PeriodicalIF":1.8,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3912269/pdf/nihms537499.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32104198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Overview of Current Software Procedures for Fitting Linear Mixed Models. 当前拟合线性混合模型的软件程序概述。

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2012-01-24 DOI: 10.1198/tas.2011.11077

Brady T West, Andrzej T Galecki

At present, there are many software procedures available that enable statisticians to fit linear mixed models (LMMs) to continuous dependent variables in clustered or longitudinal datasets. LMMs are flexible tools for analyzing relationships among variables in these types of datasets, in that a variety of covariance structures can be used depending on the subject matter under study. The explicit random effects in LMMs allow analysts to make inferences about the variability between clusters or subjects in larger hypothetical populations, and examine cluster- or subject-level variables that explain portions of this variability. These models can also be used to analyze longitudinal or clustered datasets with data that are missing at random (MAR), and can accommodate time-varying covariates in longitudinal datasets. Although the software procedures currently available have many features in common, more specific analytic aspects of fitting LMMs (e.g., crossed random effects, appropriate hypothesis testing for variance components, diagnostics, incorporating sampling weights) may only be available in selected software procedures. With this article, we aim to perform a comprehensive and up-to-date comparison of the current capabilities of software procedures for fitting LMMs, and provide statisticians with a guide for selecting a software procedure appropriate for their analytic goals.

目前，有许多可用的软件程序使统计学家能够将线性混合模型(lmm)拟合到聚类或纵向数据集中的连续因变量中。lmm是分析这些类型数据集中变量之间关系的灵活工具，因为可以根据所研究的主题使用各种协方差结构。lmm中明确的随机效应允许分析人员对更大的假设人群中集群或受试者之间的可变性进行推断，并检查解释这种可变性的部分集群或受试者水平变量。这些模型还可以用于分析纵向或聚类数据集，这些数据集具有随机缺失(MAR)的数据，并且可以在纵向数据集中容纳时变协变量。虽然目前可用的软件程序有许多共同的特征，但拟合lmm的更具体的分析方面(例如，交叉随机效应，方差成分的适当假设检验，诊断，纳入抽样权重)可能只在选定的软件程序中可用。在本文中，我们的目标是对适合lmm的软件过程的当前能力进行全面和最新的比较，并为统计学家提供选择适合其分析目标的软件过程的指南。

{"title":"An Overview of Current Software Procedures for Fitting Linear Mixed Models.","authors":"Brady T West, Andrzej T Galecki","doi":"10.1198/tas.2011.11077","DOIUrl":"https://doi.org/10.1198/tas.2011.11077","url":null,"abstract":"At present, there are many software procedures available that enable statisticians to fit linear mixed models (LMMs) to continuous dependent variables in clustered or longitudinal datasets. LMMs are flexible tools for analyzing relationships among variables in these types of datasets, in that a variety of covariance structures can be used depending on the subject matter under study. The explicit random effects in LMMs allow analysts to make inferences about the variability between clusters or subjects in larger hypothetical populations, and examine cluster- or subject-level variables that explain portions of this variability. These models can also be used to analyze longitudinal or clustered datasets with data that are missing at random (MAR), and can accommodate time-varying covariates in longitudinal datasets. Although the software procedures currently available have many features in common, more specific analytic aspects of fitting LMMs (e.g., crossed random effects, appropriate hypothesis testing for variance components, diagnostics, incorporating sampling weights) may only be available in selected software procedures. With this article, we aim to perform a comprehensive and up-to-date comparison of the current capabilities of software procedures for fitting LMMs, and provide statisticians with a guide for selecting a software procedure appropriate for their analytic goals.","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 4","pages":"274-282"},"PeriodicalIF":1.8,"publicationDate":"2012-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/tas.2011.11077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31375746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Estimating the Correlation in Bivariate Normal Data with Known Variances and Small Sample Sizes(). 已知方差和小样本量的双变量正态数据的相关性估计()。

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2012-01-01 Epub Date: 2012-03-21 DOI: 10.1080/00031305.2012.676329

Bailey K Fosdick, Adrian E Raftery

We consider the problem of estimating the correlation in bivariate normal data when the means and variances are assumed known, with emphasis on the small sample case. We consider eight different estimators, several of them considered here for the first time in the literature. In a simulation study, we found that Bayesian estimators using the uniform and arc-sine priors outperformed several empirical and exact or approximate maximum likelihood estimators in small samples. The arc-sine prior did better for large values of the correlation. For testing whether the correlation is zero, we found that Bayesian hypothesis tests outperformed significance tests based on the empirical and exact or approximate maximum likelihood estimators considered in small samples, but that all tests performed similarly for sample size 50. These results lead us to suggest using the posterior mean with the arc-sine prior to estimate the correlation in small samples when the variances are assumed known.

我们考虑在假设均值和方差已知的情况下估计二元正态数据相关性的问题，重点是小样本情况。我们考虑了八种不同的估计量，其中一些是在文献中第一次考虑。在模拟研究中，我们发现使用均匀和正弦先验的贝叶斯估计器在小样本中优于几个经验和精确或近似的最大似然估计器。弧正弦先验在相关性较大时表现较好。对于检验相关性是否为零，我们发现贝叶斯假设检验优于基于小样本中考虑的经验和精确或近似最大似然估计的显著性检验，但对于样本量为50的所有检验都表现相似。这些结果使我们建议在方差已知的情况下，在估计小样本的相关性之前使用后验均值和arcsin。

引用次数: 30

Estimation of Progression-Free Survival for All Treated Patients in the Randomized Discontinuation Trial Design. 随机停药试验设计中所有治疗患者无进展生存期的估计。

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2012-01-01 DOI: 10.1080/00031305.2012.720900

Theodore G Karrison, Mark J Ratain, Walter M Stadler, Gary L Rosner

The randomized discontinuation trial (RDT) design is an enrichment-type design that has been used in a variety of diseases to evaluate the efficacy of new treatments. The RDT design seeks to select a more homogeneous group of patients, consisting of those who are more likely to show a treatment benefit if one exists. In oncology, the RDT design has been applied to evaluate the effects of cytostatic agents, that is, drugs that act primarily by slowing tumor growth rather than shrinking tumors. In the RDT design, all patients receive treatment during an initial, open-label run-in period of duration T. Patients with objective response (substantial tumor shrinkage) remain on therapy while those with early progressive disease are removed from the trial. Patients with stable disease (SD) are then randomized to either continue active treatment or switched to placebo. The main analysis compares outcomes, for example, progression-free survival (PFS), between the two randomized arms. As a secondary objective, investigators may seek to estimate PFS for all treated patients, measured from the time of entry into the study, by combining information from the run-in and post run-in periods. For t ≤ T, PFS is estimated by the observed proportion of patients who are progression-free among all patients enrolled. For t > T, the estimate can be expressed as Ŝ(t) = p̂_OR × Ŝ_OR(t - T) + p̂_SD × Ŝ_SD(t - T), where p̂_OR is the estimated probability of response during the run-in period, p̂_SD is the estimated probability of SD, and Ŝ_OR(t - T) and Ŝ_SD(t - T) are the Kaplan-Meier estimates of subsequent PFS in the responders and patients with SD randomized to continue treatment, respectively. In this article, we derive the variance of Ŝ(t), enabling the construction of confidence intervals for both S(t) and the median survival time. Simulation results indicate that the method provides accurate coverage rates. An interesting aspect of the design is that outcomes during the run-in phase have a negative multinomial distribution, something not frequently encountered in practice.

随机停药试验(RDT)设计是一种富集型设计，已用于多种疾病来评估新疗法的疗效。RDT设计旨在选择一组更均匀的患者，包括那些更有可能显示出治疗益处(如果存在)的患者。在肿瘤学中，RDT设计已被用于评估细胞抑制剂的效果，即主要通过减缓肿瘤生长而不是缩小肿瘤来起作用的药物。在RDT设计中，所有患者在初始的开放标签磨合期t期间接受治疗，客观反应(肿瘤显著缩小)的患者继续接受治疗，而早期进展性疾病的患者则退出试验。病情稳定(SD)的患者随后被随机分为继续积极治疗组或改用安慰剂组。主要分析比较了两个随机组之间的结果，例如无进展生存期(PFS)。作为次要目标，研究人员可以通过结合磨合期和磨合期后的信息来估计所有治疗患者的PFS，从进入研究时开始测量。对于t≤t, PFS由所有入组患者中观察到的无进展患者的比例来估计。对于t > t，其估计值可表示为Ŝ(t) = p′OR × ŜOR(t - t) + p′SD × ŜSD(t - t)，其中p′OR为磨合期应答的估计概率，p′SD为SD的估计概率，ŜOR(t - t)和ŜSD(t - t)分别为应答者和随机继续治疗的SD患者后续PFS的Kaplan-Meier估计值。在本文中，我们推导了Ŝ(t)的方差，从而能够构建S(t)和中位生存时间的置信区间。仿真结果表明，该方法可以提供准确的覆盖率。该设计的一个有趣方面是，磨合阶段的结果呈负多项分布，这在实践中并不常见。

{"title":"Estimation of Progression-Free Survival for All Treated Patients in the Randomized Discontinuation Trial Design.","authors":"Theodore G Karrison, Mark J Ratain, Walter M Stadler, Gary L Rosner","doi":"10.1080/00031305.2012.720900","DOIUrl":"https://doi.org/10.1080/00031305.2012.720900","url":null,"abstract":"The randomized discontinuation trial (RDT) design is an enrichment-type design that has been used in a variety of diseases to evaluate the efficacy of new treatments. The RDT design seeks to select a more homogeneous group of patients, consisting of those who are more likely to show a treatment benefit if one exists. In oncology, the RDT design has been applied to evaluate the effects of cytostatic agents, that is, drugs that act primarily by slowing tumor growth rather than shrinking tumors. In the RDT design, all patients receive treatment during an initial, open-label run-in period of duration T. Patients with objective response (substantial tumor shrinkage) remain on therapy while those with early progressive disease are removed from the trial. Patients with stable disease (SD) are then randomized to either continue active treatment or switched to placebo. The main analysis compares outcomes, for example, progression-free survival (PFS), between the two randomized arms. As a secondary objective, investigators may seek to estimate PFS for all treated patients, measured from the time of entry into the study, by combining information from the run-in and post run-in periods. For t ≤ T, PFS is estimated by the observed proportion of patients who are progression-free among all patients enrolled. For t > T, the estimate can be expressed as Ŝ(t) = p̂OR × ŜOR(t - T) + p̂SD × ŜSD(t - T), where p̂OR is the estimated probability of response during the run-in period, p̂SD is the estimated probability of SD, and ŜOR(t - T) and ŜSD(t - T) are the Kaplan-Meier estimates of subsequent PFS in the responders and patients with SD randomized to continue treatment, respectively. In this article, we derive the variance of Ŝ(t), enabling the construction of confidence intervals for both S(t) and the median survival time. Simulation results indicate that the method provides accurate coverage rates. An interesting aspect of the design is that outcomes during the run-in phase have a negative multinomial distribution, something not frequently encountered in practice.","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 3","pages":"155-162"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.720900","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31736474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Simulation Based Evaluation of the Asymptotic Power Formulae for Cox Models in Small Sample Cases. 小样本情况下Cox模型渐近幂公式的仿真评估。

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2012-01-01 DOI: 10.1080/00031305.2012.703873

Mehmet Kocak, Arzu Onar-Thomas

Cox proportional hazards (PH) models are commonly used in medical research to investigate the associations between covariates and time to event outcomes. It is frequently noted that with less than ten events per covariate, these models produce spurious results, and therefore, should not be used. Statistical literature contains asymptotic power formulae for the Cox model which can be used to determine the number of events needed to detect an association. Here we investigate via simulations the performance of these formulae in small sample settings for Cox models with 1- or 2-covariates. Our simulations indicate that, when the number of events is small, the power estimate based on the asymptotic formulae is often inflated. The discrepancy between the asymptotic and empirical power is larger for the dichotomous covariate especially in cases where allocation of sample size to its levels is unequal. When more than one covariate is included in the same model, the discrepancy between the asymptotic power and the empirical power is even larger, especially when a high positive correlation exists between the two covariates.

Cox比例风险(PH)模型通常用于医学研究，以调查协变量与事件结果时间之间的关系。经常注意到，每个协变量少于10个事件，这些模型产生虚假的结果，因此，不应该使用。统计文献包含Cox模型的渐近幂公式，可用于确定检测关联所需的事件数。在这里，我们通过模拟来研究这些公式在带有1或2协变量的Cox模型的小样本设置中的性能。我们的模拟表明，当事件数较少时，基于渐近公式的功率估计往往会被夸大。二分类协变量的渐近幂和经验幂之间的差异较大，特别是在样本大小分配到其水平不相等的情况下。当同一模型中包含多个协变量时，渐近幂与经验幂之间的差异更大，特别是当两个协变量之间存在高度正相关时。

{"title":"A Simulation Based Evaluation of the Asymptotic Power Formulae for Cox Models in Small Sample Cases.","authors":"Mehmet Kocak, Arzu Onar-Thomas","doi":"10.1080/00031305.2012.703873","DOIUrl":"https://doi.org/10.1080/00031305.2012.703873","url":null,"abstract":"Cox proportional hazards (PH) models are commonly used in medical research to investigate the associations between covariates and time to event outcomes. It is frequently noted that with less than ten events per covariate, these models produce spurious results, and therefore, should not be used. Statistical literature contains asymptotic power formulae for the Cox model which can be used to determine the number of events needed to detect an association. Here we investigate via simulations the performance of these formulae in small sample settings for Cox models with 1- or 2-covariates. Our simulations indicate that, when the number of events is small, the power estimate based on the asymptotic formulae is often inflated. The discrepancy between the asymptotic and empirical power is larger for the dichotomous covariate especially in cases where allocation of sample size to its levels is unequal. When more than one covariate is included in the same model, the discrepancy between the asymptotic power and the empirical power is even larger, especially when a high positive correlation exists between the two covariates.","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 3","pages":"173-179"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.703873","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31798842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Treatment Heterogeneity and Individual Qualitative Interaction. 治疗异质性和个体质的相互作用。

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2012-01-01 Epub Date: 2012-06-12 DOI: 10.1080/00031305.2012.671724

Robert S Poulson, Gary L Gadbury, David B Allison

Plausibility of high variability in treatment effects across individuals has been recognized as an important consideration in clinical studies. Surprisingly, little attention has been given to evaluating this variability in design of clinical trials or analyses of resulting data. High variation in a treatment's efficacy or safety across individuals (referred to herein as treatment heterogeneity) may have important consequences because the optimal treatment choice for an individual may be different from that suggested by a study of average effects. We call this an individual qualitative interaction (IQI), borrowing terminology from earlier work - referring to a qualitative interaction (QI) being present when the optimal treatment varies across a"groups" of individuals. At least three techniques have been proposed to investigate treatment heterogeneity: techniques to detect a QI, use of measures such as the density overlap of two outcome variables under different treatments, and use of cross-over designs to observe "individual effects." We elucidate underlying connections among them, their limitations and some assumptions that may be required. We do so under a potential outcomes framework that can add insights to results from usual data analyses and to study design features that improve the capability to more directly assess treatment heterogeneity.

个体间治疗效果的高变异性的合理性已被认为是临床研究中的一个重要考虑因素。令人惊讶的是，很少有人注意到在临床试验设计或结果数据分析中评估这种可变性。个体间治疗有效性或安全性的高度差异(此处称为治疗异质性)可能会产生重要后果，因为个体的最佳治疗选择可能不同于平均效果研究所建议的治疗选择。我们将其称为个体定性交互作用(IQI)，借用早期工作中的术语——指的是当最佳治疗在个体“群体”中发生变化时，存在的定性交互作用(QI)。至少提出了三种技术来调查治疗异质性:检测QI的技术，使用不同治疗下两个结果变量的密度重叠等测量方法，以及使用交叉设计来观察“个体效应”。我们阐明了它们之间的潜在联系，它们的局限性和一些可能需要的假设。我们这样做是在一个潜在的结果框架下进行的，该框架可以增加对常规数据分析结果的见解，并研究设计特征，从而提高更直接评估治疗异质性的能力。

{"title":"Treatment Heterogeneity and Individual Qualitative Interaction.","authors":"Robert S Poulson, Gary L Gadbury, David B Allison","doi":"10.1080/00031305.2012.671724","DOIUrl":"https://doi.org/10.1080/00031305.2012.671724","url":null,"abstract":"Plausibility of high variability in treatment effects across individuals has been recognized as an important consideration in clinical studies. Surprisingly, little attention has been given to evaluating this variability in design of clinical trials or analyses of resulting data. High variation in a treatment's efficacy or safety across individuals (referred to herein as treatment heterogeneity) may have important consequences because the optimal treatment choice for an individual may be different from that suggested by a study of average effects. We call this an individual qualitative interaction (IQI), borrowing terminology from earlier work - referring to a qualitative interaction (QI) being present when the optimal treatment varies across a\"groups\" of individuals. At least three techniques have been proposed to investigate treatment heterogeneity: techniques to detect a QI, use of measures such as the density overlap of two outcome variables under different treatments, and use of cross-over designs to observe \"individual effects.\" We elucidate underlying connections among them, their limitations and some assumptions that may be required. We do so under a potential outcomes framework that can add insights to results from usual data analyses and to study design features that improve the capability to more directly assess treatment heterogeneity.","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 1","pages":"16-24"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.671724","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31092749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context. 基因组学背景下使用 Oracle 方法进行交叉验证的经验性能。

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2011-11-01 DOI: 10.1198/tas.2011.11052

Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee

When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.

在使用平滑截断绝对偏差（SCAD）和自适应套索（Adaptive Lasso）等具有甲骨文特性的模型选择方法时，通常会通过 m 倍交叉验证来估计平滑参数，例如 m = 10。在真实回归函数稀疏、信号量大的问题中，这种交叉验证通常效果很好。然而，在涉及单核苷酸多态性（SNP）的基因组研究回归建模中，真正的回归函数虽然被认为是稀疏的，但信号并不大。我们通过实证证明，在此类问题中，使用 SCAD 和自适应套索法（10 倍交叉验证）所选变量的数量是一个随机变量，其变化相当大，令人惊讶。类似的结论也适用于 Lasso 等非oracle 方法。我们的研究强烈质疑对任何甲骨文方法（不仅仅是 SCAD 和 Adaptive Lasso）只进行一次 m 倍交叉验证是否合适。

引用次数: 0

Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery. 手术后再入院病例对照研究中几个稀疏标称变量的匹配。

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2011-10-01 DOI: 10.1198/tas.2011.11072

José R Zubizarreta, Caroline E Reinke, Rachel R Kelz, Jeffrey H Silber, Paul R Rosenbaum

Matching for several nominal covariates with many levels has usually been thought to be difficult because these covariates combine to form an enormous number of interaction categories with few if any people in most such categories. Moreover, because nominal variables are not ordered, there is often no notion of a "close substitute" when an exact match is unavailable. In a case-control study of the risk factors for read-mission within 30 days of surgery in the Medicare population, we wished to match for 47 hospitals, 15 surgical procedures grouped or nested within 5 procedure groups, two genders, or 47 × 15 × 2 = 1410 categories. In addition, we wished to match as closely as possible for the continuous variable age (65-80 years). There were 1380 readmitted patients or cases. A fractional factorial experiment may balance main effects and low-order interactions without achieving balance for high-order interactions. In an analogous fashion, we balance certain main effects and low-order interactions among the covariates; moreover, we use as many exactly matched pairs as possible. This is done by creating a match that is exact for several variables, with a close match for age, and both a "near-exact match" and a "finely balanced match" for another nominal variable, in this case a 47 × 5 = 235 category variable representing the interaction of the 47 hospitals and the five surgical procedure groups. The method is easily implemented in R.

与多个水平的几个名义协变量匹配通常被认为是困难的，因为这些协变量组合在一起形成了大量的交互类别，其中大多数类别中几乎没有人。此外，由于名义变量不是有序的，所以当无法获得精确匹配时，通常没有“接近替代”的概念。在一项针对医疗保险人群手术后30天内阅读任务危险因素的病例对照研究中，我们希望匹配47家医院、15种外科手术、5种手术组、两种性别或47 × 15 × 2 = 1410种类别。此外，我们希望尽可能地匹配连续可变年龄(65-80岁)。再入院病人或病例1380例。分数析因实验可以平衡主效应和低阶相互作用，而不能平衡高阶相互作用。以类似的方式，我们平衡了协变量之间的某些主效应和低阶相互作用;此外，我们使用尽可能多的完全匹配的对。这是通过创建几个变量的精确匹配来实现的，其中年龄的匹配非常接近，并且为另一个名义变量创建“接近精确匹配”和“精细平衡匹配”，在本例中，47 × 5 = 235类别变量表示47家医院和5个外科手术组的相互作用。该方法很容易在R中实现。

{"title":"Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery.","authors":"José R Zubizarreta, Caroline E Reinke, Rachel R Kelz, Jeffrey H Silber, Paul R Rosenbaum","doi":"10.1198/tas.2011.11072","DOIUrl":"https://doi.org/10.1198/tas.2011.11072","url":null,"abstract":"Matching for several nominal covariates with many levels has usually been thought to be difficult because these covariates combine to form an enormous number of interaction categories with few if any people in most such categories. Moreover, because nominal variables are not ordered, there is often no notion of a \"close substitute\" when an exact match is unavailable. In a case-control study of the risk factors for read-mission within 30 days of surgery in the Medicare population, we wished to match for 47 hospitals, 15 surgical procedures grouped or nested within 5 procedure groups, two genders, or 47 × 15 × 2 = 1410 categories. In addition, we wished to match as closely as possible for the continuous variable age (65-80 years). There were 1380 readmitted patients or cases. A fractional factorial experiment may balance main effects and low-order interactions without achieving balance for high-order interactions. In an analogous fashion, we balance certain main effects and low-order interactions among the covariates; moreover, we use as many exactly matched pairs as possible. This is done by creating a match that is exact for several variables, with a close match for age, and both a \"near-exact match\" and a \"finely balanced match\" for another nominal variable, in this case a 47 × 5 = 235 category variable representing the interaction of the 47 hospitals and the five surgical procedure groups. The method is easily implemented in R.","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 4","pages":"229-238"},"PeriodicalIF":1.8,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/tas.2011.11072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32832138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Efficient Classification-Based Relabeling in Mixture Models. 混合模型中基于分类的高效重标注。

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2011-02-01 DOI: 10.1198/tast.2011.10170

Andrew J Cron, Mike West

Effective component relabeling in Bayesian analyses of mixture models is critical to the routine use of mixtures in classification with analysis based on Markov chain Monte Carlo methods. The classification-based relabeling approach here is computationally attractive and statistically effective, and scales well with sample size and number of mixture components concordant with enabling routine analyses of increasingly large data sets. Building on the best of existing methods, practical relabeling aims to match data:component classification indicators in MCMC iterates with those of a defined reference mixture distribution. The method performs as well as or better than existing methods in small dimensional problems, while being practically superior in problems with larger data sets as the approach is scalable. We describe examples and computational benchmarks, and provide supporting code with efficient computational implementation of the algorithm that will be of use to others in practical applications of mixture models.

混合模型贝叶斯分析中有效的成分重标记对于混合模型在基于马尔可夫链蒙特卡罗方法的分类分析中的常规应用至关重要。基于分类的重新标记方法在计算上很有吸引力，在统计上也很有效，并且可以很好地扩展样本量和混合成分的数量，从而能够对越来越大的数据集进行常规分析。在现有最佳方法的基础上，实际的重标注旨在匹配数据:MCMC中的成分分类指标与定义的参考混合分布的成分分类指标进行迭代。该方法在小维度问题上的表现与现有方法一样好，甚至更好，同时由于该方法具有可扩展性，因此在具有较大数据集的问题上实际上更优越。我们描述了示例和计算基准，并提供了算法的有效计算实现的支持代码，这些代码将用于混合模型的实际应用中。

引用次数: 53

Optimal Nonbipartite Matching and Its Statistical Applications. 最优非双方位匹配及其统计应用

IF 1.8 4区数学 Q1 Mathematics

American Statistician

Pub Date : 2011-01-01 Epub Date: 2012-01-01 DOI: 10.1198/tast.2011.08294

Bo Lu, Robert Greevy, Xinyi Xu, Cole Beck

Matching is a powerful statistical tool in design and analysis. Conventional two-group, or bipartite, matching has been widely used in practice. However, its utility is limited to simpler designs. In contrast, nonbipartite matching is not limited to the two-group case, handling multiparty matching situations. It can be used to find the set of matches that minimize the sum of distances based on a given distance matrix. It brings greater flexibility to the matching design, such as multigroup comparisons. Thanks to improvements in computing power and freely available algorithms to solve nonbipartite problems, the cost in terms of computation time and complexity is low. This article reviews the optimal nonbipartite matching algorithm and its statistical applications, including observational studies with complex designs and an exact distribution-free test comparing two multivariate distributions. We also introduce an R package that performs optimal nonbipartite matching. We present an easily accessible web application to make nonbipartite matching freely available to general researchers.

匹配是设计和分析中一个强大的统计工具。传统的两组或两方匹配在实践中得到了广泛应用。然而，它的作用仅限于较简单的设计。相比之下，非双方位匹配并不局限于两组情况，它可以处理多方匹配的情况。它可以根据给定的距离矩阵，找到使距离总和最小的匹配集合。它为匹配设计带来了更大的灵活性，例如多组比较。得益于计算能力的提高和可免费获得的解决非双方差问题的算法，非双方差问题在计算时间和复杂度方面的成本都很低。本文回顾了最优非双方差匹配算法及其统计应用，包括复杂设计的观察研究和比较两个多变量分布的精确无分布检验。我们还介绍了一个可执行最优非双方差匹配的 R 软件包。我们还介绍了一个易于访问的网络应用程序，使普通研究人员可以免费使用非双方差匹配算法。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

American Statistician

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀