Statistical Applications in Genetics and Molecular Biology最新文献_第10页

A PAUC-based estimation technique for disease classification and biomarker selection. 基于pauc的疾病分类和生物标志物选择估计技术。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-10-01 DOI: 10.1515/1544-6115.1792

Matthias Schmid, Torsten Hothorn, Friedemann Krause, Christina Rabe

The partial area under the receiver operating characteristic curve (PAUC) is a well-established performance measure to evaluate biomarker combinations for disease classification. Because the PAUC is defined as the area under the ROC curve within a restricted interval of false positive rates, it enables practitioners to quantify sensitivity rates within pre-specified specificity ranges. This issue is of considerable importance for the development of medical screening tests. Although many authors have highlighted the importance of PAUC, there exist only few methods that use the PAUC as an objective function for finding optimal combinations of biomarkers. In this paper, we introduce a boosting method for deriving marker combinations that is explicitly based on the PAUC criterion. The proposed method can be applied in high-dimensional settings where the number of biomarkers exceeds the number of observations. Additionally, the proposed method incorporates a recently proposed variable selection technique (stability selection) that results in sparse prediction rules incorporating only those biomarkers that make relevant contributions to predicting the outcome of interest. Using both simulated data and real data, we demonstrate that our method performs well with respect to both variable selection and prediction accuracy. Specifically, if the focus is on a limited range of specificity values, the new method results in better predictions than other established techniques for disease classification.

受试者工作特征曲线下的部分面积(paoc)是一种公认的评估生物标志物组合用于疾病分类的性能指标。由于pac被定义为在假阳性率的限定区间内ROC曲线下的面积，因此从业人员可以在预先规定的特异性范围内量化敏感性。这一问题对发展医疗筛查试验具有相当重要的意义。尽管许多作者都强调了pac的重要性，但使用pac作为寻找生物标志物最佳组合的目标函数的方法很少。在本文中，我们介绍了一种基于paoc准则的标记组合的增强方法。所提出的方法可以应用于高维设置，其中生物标志物的数量超过了观测的数量。此外，所提出的方法结合了最近提出的变量选择技术(稳定性选择)，该技术导致稀疏预测规则仅包含那些对预测感兴趣的结果做出相关贡献的生物标志物。通过模拟数据和实际数据，我们证明了我们的方法在变量选择和预测精度方面都有很好的效果。具体来说，如果重点是在一个有限范围的特异性值上，新方法的预测结果比其他已建立的疾病分类技术更好。

{"title":"A PAUC-based estimation technique for disease classification and biomarker selection.","authors":"Matthias Schmid, Torsten Hothorn, Friedemann Krause, Christina Rabe","doi":"10.1515/1544-6115.1792","DOIUrl":"https://doi.org/10.1515/1544-6115.1792","url":null,"abstract":"The partial area under the receiver operating characteristic curve (PAUC) is a well-established performance measure to evaluate biomarker combinations for disease classification. Because the PAUC is defined as the area under the ROC curve within a restricted interval of false positive rates, it enables practitioners to quantify sensitivity rates within pre-specified specificity ranges. This issue is of considerable importance for the development of medical screening tests. Although many authors have highlighted the importance of PAUC, there exist only few methods that use the PAUC as an objective function for finding optimal combinations of biomarkers. In this paper, we introduce a boosting method for deriving marker combinations that is explicitly based on the PAUC criterion. The proposed method can be applied in high-dimensional settings where the number of biomarkers exceeds the number of observations. Additionally, the proposed method incorporates a recently proposed variable selection technique (stability selection) that results in sparse prediction rules incorporating only those biomarkers that make relevant contributions to predicting the outcome of interest. Using both simulated data and real data, we demonstrate that our method performs well with respect to both variable selection and prediction accuracy. Specifically, if the focus is on a limited range of specificity values, the new method results in better predictions than other established techniques for disease classification.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1792","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30951968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

DNA pooling and statistical tests for the detection of single nucleotide polymorphisms. 单核苷酸多态性检测的DNA池和统计检验。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-09-25 DOI: 10.1515/1544-6115.1763

David M Ramsey, Andreas Futschik

The development of next generation genome sequencers gives the opportunity of learning more about the genetic make-up of human and other populations. One important question involves the location of sites at which variation occurs within a population. Our focus will be on the detection of rare variants. Such variants will often not be present in smaller samples and are hard to distinguish from sequencing errors in larger samples. This is particularly true for pooled samples which are often used as part of a cost saving strategy. The focus of this article is on experiments that involve DNA pooling. We derive experimental designs that optimize the power of statistical tests for detecting single nucleotide polymorphisms (SNPs, sites at which there is variation within a population). We also present a new simple test that calls a SNP, if the maximum number of reads of a prospective variant across lanes exceeds a certain threshold. The value of this threshold is defined according to the number of available lanes, the parameters of the genome sequencer and a specified probability of accepting that there is variation at a site when no variation is present. On the basis of this test, we derive pool sizes which are optimal for the detection of rare variants. This test is compared with a likelihood ratio test, which takes into account the number of reads of a prospective variant from all the lanes. It is shown that the threshold based rule achieves a comparable power to this likelihood ratio test and may well be a useful tool in determining near optimal pool sizes for the detection of rare alleles in practical applications.

下一代基因组测序仪的发展为更多地了解人类和其他种群的基因组成提供了机会。一个重要的问题涉及种群中变异发生的地点。我们的重点将放在罕见变异的检测上。这些变异通常不会出现在较小的样本中，并且很难与较大样本中的测序错误区分开来。对于通常用作节省成本策略一部分的汇集样本来说尤其如此。本文的重点是涉及DNA池的实验。我们推导了实验设计，优化了检测单核苷酸多态性(snp，种群中存在变异的位点)的统计测试的能力。我们还提出了一种新的简单测试，如果一个潜在变异的最大读取次数超过一定的阈值，则称为SNP。这个阈值是根据可用通道的数量、基因组测序仪的参数以及在没有变异存在的情况下接受位点存在变异的指定概率来定义的。在此测试的基础上，我们推导出最适合检测罕见变异的池大小。该测试与似然比测试进行比较，似然比测试考虑了来自所有车道的预期变体的读取数量。结果表明，基于阈值的规则达到了与该似然比检验相当的能力，并且可能是确定在实际应用中检测罕见等位基因的接近最佳池大小的有用工具。

{"title":"DNA pooling and statistical tests for the detection of single nucleotide polymorphisms.","authors":"David M Ramsey, Andreas Futschik","doi":"10.1515/1544-6115.1763","DOIUrl":"https://doi.org/10.1515/1544-6115.1763","url":null,"abstract":"The development of next generation genome sequencers gives the opportunity of learning more about the genetic make-up of human and other populations. One important question involves the location of sites at which variation occurs within a population. Our focus will be on the detection of rare variants. Such variants will often not be present in smaller samples and are hard to distinguish from sequencing errors in larger samples. This is particularly true for pooled samples which are often used as part of a cost saving strategy. The focus of this article is on experiments that involve DNA pooling. We derive experimental designs that optimize the power of statistical tests for detecting single nucleotide polymorphisms (SNPs, sites at which there is variation within a population). We also present a new simple test that calls a SNP, if the maximum number of reads of a prospective variant across lanes exceeds a certain threshold. The value of this threshold is defined according to the number of available lanes, the parameters of the genome sequencer and a specified probability of accepting that there is variation at a site when no variation is present. On the basis of this test, we derive pool sizes which are optimal for the detection of rare variants. This test is compared with a likelihood ratio test, which takes into account the number of reads of a prospective variant from all the lanes. It is shown that the threshold based rule achieves a comparable power to this likelihood ratio test and may well be a useful tool in determining near optimal pool sizes for the detection of rare alleles in practical applications.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"Article 1"},"PeriodicalIF":0.9,"publicationDate":"2012-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1763","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30943216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Comparison of targeted maximum likelihood and shrinkage estimators of parameters in gene networks. 基因网络中参数的目标最大似然估计和收缩估计的比较。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-09-25 DOI: 10.1515/1544-6115.1728

Geert Geeven, Mark J van der Laan, Mathisca C M de Gunst

Gene regulatory networks, in which edges between nodes describe interactions between transcription factors (TFs) and their target genes, model regulatory interactions that determine the cell-type and condition-specific expression of genes. Regression methods can be used to identify TF-target gene interactions from gene expression and DNA sequence data. The response variable, i.e. observed gene expression, is modeled as a function of many predictor variables simultaneously. In practice, it is generally not possible to select a single model that clearly achieves the best fit to the observed experimental data and the selected models typically contain overlapping sets of predictor variables. Moreover, parameters that represent the marginal effect of the individual predictors are not always present. In this paper, we use the statistical framework of estimation of variable importance to define variable importance as a parameter of interest and study two different estimators of this parameter in the context of gene regulatory networks. On yeast data we show that the resulting parameter has a biologically appealing interpretation. We apply the proposed methodology on mammalian gene expression data to gain insight into the temporal activity of TFs that underly gene expression changes in F11 cells in response to Forskolin stimulation.

基因调控网络，其中节点之间的边缘描述了转录因子(TFs)与其靶基因之间的相互作用，模拟了决定基因细胞类型和条件特异性表达的调控相互作用。回归方法可用于从基因表达和DNA序列数据中识别tf靶基因的相互作用。响应变量，即观察到的基因表达，同时被建模为许多预测变量的函数。在实践中，通常不可能选择一个明确地与观察到的实验数据达到最佳拟合的单一模型，并且所选模型通常包含重叠的预测变量集。此外，代表个别预测因子边际效应的参数并不总是存在。在本文中，我们使用变量重要性估计的统计框架来定义变量重要性作为感兴趣的参数，并在基因调控网络的背景下研究了该参数的两种不同的估计。在酵母数据上，我们表明，所得参数具有生物学上吸引人的解释。我们将提出的方法应用于哺乳动物基因表达数据，以深入了解F11细胞中响应Forskolin刺激的基因表达变化的tf的时间活性。

{"title":"Comparison of targeted maximum likelihood and shrinkage estimators of parameters in gene networks.","authors":"Geert Geeven, Mark J van der Laan, Mathisca C M de Gunst","doi":"10.1515/1544-6115.1728","DOIUrl":"https://doi.org/10.1515/1544-6115.1728","url":null,"abstract":"Gene regulatory networks, in which edges between nodes describe interactions between transcription factors (TFs) and their target genes, model regulatory interactions that determine the cell-type and condition-specific expression of genes. Regression methods can be used to identify TF-target gene interactions from gene expression and DNA sequence data. The response variable, i.e. observed gene expression, is modeled as a function of many predictor variables simultaneously. In practice, it is generally not possible to select a single model that clearly achieves the best fit to the observed experimental data and the selected models typically contain overlapping sets of predictor variables. Moreover, parameters that represent the marginal effect of the individual predictors are not always present. In this paper, we use the statistical framework of estimation of variable importance to define variable importance as a parameter of interest and study two different estimators of this parameter in the context of gene regulatory networks. On yeast data we show that the resulting parameter has a biologically appealing interpretation. We apply the proposed methodology on mammalian gene expression data to gain insight into the temporal activity of TFs that underly gene expression changes in F11 cells in response to Forskolin stimulation.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"Article 2"},"PeriodicalIF":0.9,"publicationDate":"2012-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1728","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30943215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hessian calculation for phylogenetic likelihood based on the pruning algorithm and its applications. 基于剪枝算法的系统发生似然的Hessian计算及其应用。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-09-25 DOI: 10.1515/1544-6115.1779

Toby Kenney, Hong Gu

We analytically derive the first and second derivatives of the likelihood in maximum likelihood methods for phylogeny. These results enable the Newton-Raphson method to be used for maximising likelihood, which is important because there is a need for faster methods for optimisation of parameters in maximum likelihood methods. Furthermore, the calculation of the Hessian matrix also opens up possibilities for standard likelihood theory to be applied, for inference in phylogeny and for model selection problems. Another application of the Hessian matrix is local influence analysis, which can be used for detecting a number of biologically interesting phenomena. The pruning algorithm has been used to speed up computation of likelihoods for a tree. We explain how it can be used to speed up the computation for the first and second derivatives of the likelihood with respect to branch lengths and other parameters. The results in this paper apply not only to bifurcating trees, but also to general multifurcating trees. We demonstrate the use of our Hessian calculation for the three applications listed above, and compare with existing methods for those applications.

在系统发育的最大似然方法中，我们解析地推导了似然的一阶和二阶导数。这些结果使牛顿-拉夫森方法能够用于最大化似然，这很重要，因为需要更快的方法来优化最大似然方法中的参数。此外，Hessian矩阵的计算也为标准似然理论的应用开辟了可能性，用于系统发育的推理和模型选择问题。黑森矩阵的另一个应用是局部影响分析，它可用于检测一些生物学上有趣的现象。修剪算法已被用于加速树的可能性计算。我们解释了如何使用它来加快关于分支长度和其他参数的一阶和二阶似然导数的计算。本文的结果不仅适用于分岔树，而且适用于一般的分岔树。我们将演示在上面列出的三个应用程序中使用我们的Hessian计算，并与这些应用程序的现有方法进行比较。

{"title":"Hessian calculation for phylogenetic likelihood based on the pruning algorithm and its applications.","authors":"Toby Kenney, Hong Gu","doi":"10.1515/1544-6115.1779","DOIUrl":"https://doi.org/10.1515/1544-6115.1779","url":null,"abstract":"We analytically derive the first and second derivatives of the likelihood in maximum likelihood methods for phylogeny. These results enable the Newton-Raphson method to be used for maximising likelihood, which is important because there is a need for faster methods for optimisation of parameters in maximum likelihood methods. Furthermore, the calculation of the Hessian matrix also opens up possibilities for standard likelihood theory to be applied, for inference in phylogeny and for model selection problems. Another application of the Hessian matrix is local influence analysis, which can be used for detecting a number of biologically interesting phenomena. The pruning algorithm has been used to speed up computation of likelihoods for a tree. We explain how it can be used to speed up the computation for the first and second derivatives of the likelihood with respect to branch lengths and other parameters. The results in this paper apply not only to bifurcating trees, but also to general multifurcating trees. We demonstrate the use of our Hessian calculation for the three applications listed above, and compare with existing methods for those applications.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":"Article 14"},"PeriodicalIF":0.9,"publicationDate":"2012-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1779","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30943876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A new explained-variance based genetic risk score for predictive modeling of disease risk. 一种新的基于解释方差的遗传风险评分，用于疾病风险的预测建模。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-09-25 DOI: 10.1515/1544-6115.1796

Ronglin Che, Alison A Motsinger-Reif

The goal of association mapping is to identify genetic variants that predict disease, and as the field of human genetics matures, the number of successful association studies is increasing. Many such studies have shown that for many diseases, risk is explained by a reasonably large number of variants that each explains a very small amount of disease risk. This is prompting the use of genetic risk scores in building predictive models, where information across several variants is combined for predictive modeling. In the current study, we compare the performance of four previously proposed genetic risk score methods and present a new method for constructing genetic risk score that incorporates explained variance information. The methods compared include: a simple count Genetic Risk Score, an odds ratio weighted Genetic Risk Score, a direct logistic regression Genetic Risk Score, a polygenic Genetic Risk Score, and the new explained variance weighted Genetic Risk Score. We compare the methods using a wide range of simulations in two steps, with a range of the number of deleterious single nucleotide polymorphisms (SNPs) explaining disease risk, genetic modes, baseline penetrances, sample sizes, relative risks (RR) and minor allele frequencies (MAF). Several measures of model performance were compared including overall power, C-statistic and Akaike's Information Criterion. Our results show the relative performance of methods differs significantly, with the new explained variance weighted GRS (EV-GRS) generally performing favorably to the other methods.

关联图谱的目标是识别预测疾病的遗传变异，随着人类遗传学领域的成熟，成功的关联研究的数量正在增加。许多这样的研究表明，对于许多疾病，风险是由相当多的变异来解释的，每个变异解释了非常少的疾病风险。这促使人们在建立预测模型时使用遗传风险评分，在这种模型中，跨几个变体的信息被结合起来进行预测建模。在本研究中，我们比较了先前提出的四种遗传风险评分方法的性能，并提出了一种包含解释方差信息的遗传风险评分的新方法。比较的方法包括:简单计数遗传风险评分法、优势比加权遗传风险评分法、直接逻辑回归遗传风险评分法、多基因遗传风险评分法和新解释方差加权遗传风险评分法。我们通过两步模拟比较了两种方法，其中有害单核苷酸多态性(snp)的数量范围可以解释疾病风险、遗传模式、基线外显率、样本量、相对风险(RR)和次要等位基因频率(MAF)。比较了模型性能的几个指标，包括总功率、c统计量和赤池信息标准。我们的研究结果表明，方法的相对性能差异很大，新的解释方差加权GRS (EV-GRS)总体上优于其他方法。

{"title":"A new explained-variance based genetic risk score for predictive modeling of disease risk.","authors":"Ronglin Che, Alison A Motsinger-Reif","doi":"10.1515/1544-6115.1796","DOIUrl":"https://doi.org/10.1515/1544-6115.1796","url":null,"abstract":"The goal of association mapping is to identify genetic variants that predict disease, and as the field of human genetics matures, the number of successful association studies is increasing. Many such studies have shown that for many diseases, risk is explained by a reasonably large number of variants that each explains a very small amount of disease risk. This is prompting the use of genetic risk scores in building predictive models, where information across several variants is combined for predictive modeling. In the current study, we compare the performance of four previously proposed genetic risk score methods and present a new method for constructing genetic risk score that incorporates explained variance information. The methods compared include: a simple count Genetic Risk Score, an odds ratio weighted Genetic Risk Score, a direct logistic regression Genetic Risk Score, a polygenic Genetic Risk Score, and the new explained variance weighted Genetic Risk Score. We compare the methods using a wide range of simulations in two steps, with a range of the number of deleterious single nucleotide polymorphisms (SNPs) explaining disease risk, genetic modes, baseline penetrances, sample sizes, relative risks (RR) and minor allele frequencies (MAF). Several measures of model performance were compared including overall power, C-statistic and Akaike's Information Criterion. Our results show the relative performance of methods differs significantly, with the new explained variance weighted GRS (EV-GRS) generally performing favorably to the other methods.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":"Article 15"},"PeriodicalIF":0.9,"publicationDate":"2012-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1796","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30943875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Cluster-localized sparse logistic regression for SNP data. SNP数据的聚类局部稀疏逻辑回归。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-08-14 DOI: 10.1515/1544-6115.1694

Harald Binder, Tina Müller, Holger Schwender, Klaus Golka, Michael Steffens, Jan G Hengstler, Katja Ickstadt, Martin Schumacher

The task of analyzing high-dimensional single nucleotide polymorphism (SNP) data in a case-control design using multivariable techniques has only recently been tackled. While many available approaches investigate only main effects in a high-dimensional setting, we propose a more flexible technique, cluster-localized regression (CLR), based on localized logistic regression models, that allows different SNPs to have an effect for different groups of individuals. Separate multivariable regression models are fitted for the different groups of individuals by incorporating weights into componentwise boosting, which provides simultaneous variable selection, hence sparse fits. For model fitting, these groups of individuals are identified using a clustering approach, where each group may be defined via different SNPs. This allows for representing complex interaction patterns, such as compositional epistasis, that might not be detected by a single main effects model. In a simulation study, the CLR approach results in improved prediction performance, compared to the main effects approach, and identification of important SNPs in several scenarios. Improved prediction performance is also obtained for an application example considering urinary bladder cancer. Some of the identified SNPs are predictive for all individuals, while others are only relevant for a specific group. Together with the sets of SNPs that define the groups, potential interaction patterns are uncovered.

使用多变量技术在病例对照设计中分析高维单核苷酸多态性(SNP)数据的任务直到最近才得到解决。虽然许多可用的方法只研究高维环境中的主要影响，但我们提出了一种更灵活的技术，即基于局部逻辑回归模型的集群局部回归(CLR)，该技术允许不同的snp对不同的个体群体产生影响。独立的多变量回归模型通过将权重纳入到组件增强中来拟合不同的个体组，这提供了同时的变量选择，因此稀疏拟合。对于模型拟合，使用聚类方法确定这些个体群体，其中每个群体可以通过不同的snp定义。这允许表示复杂的交互模式，例如组合上位，这可能无法被单个主效果模型检测到。在一项模拟研究中，与主效应方法相比，CLR方法的预测性能有所提高，并在几种情况下识别出重要的snp。对于考虑膀胱癌的应用实例，也获得了较好的预测性能。一些已确定的snp对所有个体都具有预测性，而另一些则仅与特定群体相关。与定义组的snp集一起，揭示了潜在的相互作用模式。

{"title":"Cluster-localized sparse logistic regression for SNP data.","authors":"Harald Binder, Tina Müller, Holger Schwender, Klaus Golka, Michael Steffens, Jan G Hengstler, Katja Ickstadt, Martin Schumacher","doi":"10.1515/1544-6115.1694","DOIUrl":"https://doi.org/10.1515/1544-6115.1694","url":null,"abstract":"The task of analyzing high-dimensional single nucleotide polymorphism (SNP) data in a case-control design using multivariable techniques has only recently been tackled. While many available approaches investigate only main effects in a high-dimensional setting, we propose a more flexible technique, cluster-localized regression (CLR), based on localized logistic regression models, that allows different SNPs to have an effect for different groups of individuals. Separate multivariable regression models are fitted for the different groups of individuals by incorporating weights into componentwise boosting, which provides simultaneous variable selection, hence sparse fits. For model fitting, these groups of individuals are identified using a clustering approach, where each group may be defined via different SNPs. This allows for representing complex interaction patterns, such as compositional epistasis, that might not be detected by a single main effects model. In a simulation study, the CLR approach results in improved prediction performance, compared to the main effects approach, and identification of important SNPs in several scenarios. Improved prediction performance is also obtained for an application example considering urinary bladder cancer. Some of the identified SNPs are predictive for all individuals, while others are only relevant for a specific group. Together with the sets of SNPs that define the groups, potential interaction patterns are uncovered.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1694","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30879018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

How to analyze many contingency tables simultaneously in genetic association studies. 遗传关联研究中如何同时分析多个列联表。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-07-27 DOI: 10.1515/1544-6115.1776

Thorsten Dickhaus, Klaus Straßburger, Daniel Schunk, Carlos Morcillo-Suarez, Thomas Illig, Arcadi Navarro

We study exact tests for (2 x 2) and (2 x 3) contingency tables, in particular exact chi-squared tests and exact tests of Fisher type. In practice, these tests are typically carried out without randomization, leading to reproducible results but not exhausting the significance level. We discuss that this can lead to methodological and practical issues in a multiple testing framework when many tables are simultaneously under consideration as in genetic association studies.Realized randomized p-values are proposed as a solution which is especially useful for data-adaptive (plug-in) procedures. These p-values allow to estimate the proportion of true null hypotheses much more accurately than their non-randomized counterparts. Moreover, we address the problem of positively correlated p-values for association by considering techniques to reduce multiplicity by estimating the "effective number of tests" from the correlation structure.An algorithm is provided that bundles all these aspects, efficient computer implementations are made available, a small-scale simulation study is presented and two real data examples are shown.

我们研究了(2 × 2)和(2 × 3)列联表的精确检验，特别是精确卡方检验和Fisher型精确检验。在实践中，这些测试通常在没有随机化的情况下进行，导致可重复的结果，但不会耗尽显著性水平。我们讨论，当在遗传关联研究中同时考虑许多表时，这可能导致多重测试框架中的方法和实际问题。实现随机p值是一种特别适用于数据自适应(插件)程序的解决方案。这些p值可以比非随机的对应值更准确地估计真实零假设的比例。此外，我们通过考虑从相关结构中估计“有效测试数”来减少多重性的技术，解决了关联的正相关p值问题。提出了一种集这些方面于一体的算法，给出了高效的计算机实现，并进行了小规模的仿真研究，给出了两个实际数据实例。

{"title":"How to analyze many contingency tables simultaneously in genetic association studies.","authors":"Thorsten Dickhaus, Klaus Straßburger, Daniel Schunk, Carlos Morcillo-Suarez, Thomas Illig, Arcadi Navarro","doi":"10.1515/1544-6115.1776","DOIUrl":"https://doi.org/10.1515/1544-6115.1776","url":null,"abstract":"We study exact tests for (2 x 2) and (2 x 3) contingency tables, in particular exact chi-squared tests and exact tests of Fisher type. In practice, these tests are typically carried out without randomization, leading to reproducible results but not exhausting the significance level. We discuss that this can lead to methodological and practical issues in a multiple testing framework when many tables are simultaneously under consideration as in genetic association studies.Realized randomized p-values are proposed as a solution which is especially useful for data-adaptive (plug-in) procedures. These p-values allow to estimate the proportion of true null hypotheses much more accurately than their non-randomized counterparts. Moreover, we address the problem of positively correlated p-values for association by considering techniques to reduce multiplicity by estimating the \"effective number of tests\" from the correlation structure.An algorithm is provided that bundles all these aspects, efficient computer implementations are made available, a small-scale simulation study is presented and two real data examples are shown.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1776","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30802134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Incorporating the empirical null hypothesis into the Benjamini-Hochberg procedure. 将经验零假设纳入Benjamini-Hochberg程序。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-07-26 DOI: 10.1515/1544-6115.1735

Debashis Ghosh

For the problem of multiple testing, the Benjamini-Hochberg (B-H) procedure has become a very popular method in applications. We show how the B-H procedure can be interpreted as a test based on the spacings corresponding to the p-value distributions. This interpretation leads to the incorporation of the empirical null hypothesis, a term coined by Efron (2004). We develop a mixture modelling approach for the empirical null hypothesis for the B-H procedure and demonstrate some theoretical results regarding both finite-sample as well as asymptotic control of the false discovery rate. The methodology is illustrated with application to two high-throughput datasets as well as to simulated data.

针对多重测试问题，benjamin - hochberg (B-H)法已成为应用中非常流行的一种方法。我们展示了如何将B-H过程解释为基于与p值分布对应的间隔的检验。这种解释导致了经验零假设的结合，这是Efron(2004)创造的一个术语。我们为B-H过程的经验零假设开发了一种混合建模方法，并证明了关于有限样本和错误发现率渐近控制的一些理论结果。通过对两个高通量数据集以及模拟数据的应用说明了该方法。

引用次数: 24

Estimating the number of one-step beneficial mutations. 估计一步有益突变的数量。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-07-19 DOI: 10.1515/1544-6115.1788

Andrzej J Wojtowicz, Craig R Miller, Paul Joyce

Mutations that confer a selective advantage to an organism are the raw material upon which natural selection acts. The number of such mutations that are available is a central quantity of interest for understanding the tempo and trajectory of adaptive evolution. While this quantity is typically unknown, it can be estimated with varying levels of accuracy based on data obtained experimentally. We propose a method for estimating the number of beneficial mutations that accounts for the evolutionary forces that generate the data. Our model-based parametric approach is compared to an adjusted nonparametric abundance-based coverage estimator. We show that, in general, our estimator performs better. When the number of mutations is small, however, the performances of the two estimators are similar.

赋予有机体选择优势的突变是自然选择作用的原料。可用的这种突变的数量是理解适应性进化的速度和轨迹的核心数量。虽然这个数量通常是未知的，但可以根据实验获得的数据以不同程度的精度进行估计。我们提出了一种方法来估计有益突变的数量，这些突变解释了产生数据的进化力量。我们的基于模型的参数方法与调整后的基于非参数丰度的覆盖估计进行了比较。我们证明，一般来说，我们的估计器性能更好。然而，当突变数较小时，这两种估计器的性能是相似的。

引用次数: 0

Testing clonality of three and more tumors using their loss of heterozygosity profiles. 利用三种及以上肿瘤的杂合性缺失谱检测其克隆性。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-07-13 DOI: 10.1515/1544-6115.1757

Irina Ostrovnaya

Cancer patients often develop multiple malignancies that may be either metastatic spread of a previous cancer (clonal tumors) or new primary cancers (independent tumors). If diagnosis cannot be easily made on the basis of the pathology review, the patterns of somatic mutations in the tumors can be compared. Previously we have developed statistical methods for testing clonality of two tumors using their loss of heterozygosity (LOH) profiles at several candidate markers. These methods can be applied to all possible pairs of tumors when multiple tumors are analyzed, but this strategy can lead to inconsistent results and loss of statistical power. In this work we will extend clonality tests to three and more malignancies from the same patient. A non-parametric test can be performed using any possible subset of tumors, with the subsequent adjustment for multiple testing. A parametric likelihood model is developed for 3 or 4 tumors, and it can be used to estimate the phylogenetic tree of tumors. The proposed tests are more powerful than combination of all possible pairwise tests.

癌症患者经常发展为多种恶性肿瘤，可能是先前癌症(克隆肿瘤)的转移性扩散，也可能是新的原发癌症(独立肿瘤)。如果在病理检查的基础上诊断不容易，可以比较肿瘤中体细胞突变的模式。以前，我们已经开发了统计方法来检测两种肿瘤的克隆性，使用它们在几个候选标记上的杂合性缺失(LOH)谱。当分析多个肿瘤时，这些方法可以应用于所有可能的肿瘤对，但这种策略可能导致结果不一致并失去统计能力。在这项工作中，我们将把克隆测试扩展到来自同一患者的三种或更多恶性肿瘤。非参数测试可以使用任何可能的肿瘤子集进行，随后进行多次测试调整。建立了3个或4个肿瘤的参数似然模型，可用于估计肿瘤的系统发育树。建议的测试比所有可能的成对测试的组合更强大。

引用次数: 2