Matthias Schmid, Torsten Hothorn, Friedemann Krause, Christina Rabe
The partial area under the receiver operating characteristic curve (PAUC) is a well-established performance measure to evaluate biomarker combinations for disease classification. Because the PAUC is defined as the area under the ROC curve within a restricted interval of false positive rates, it enables practitioners to quantify sensitivity rates within pre-specified specificity ranges. This issue is of considerable importance for the development of medical screening tests. Although many authors have highlighted the importance of PAUC, there exist only few methods that use the PAUC as an objective function for finding optimal combinations of biomarkers. In this paper, we introduce a boosting method for deriving marker combinations that is explicitly based on the PAUC criterion. The proposed method can be applied in high-dimensional settings where the number of biomarkers exceeds the number of observations. Additionally, the proposed method incorporates a recently proposed variable selection technique (stability selection) that results in sparse prediction rules incorporating only those biomarkers that make relevant contributions to predicting the outcome of interest. Using both simulated data and real data, we demonstrate that our method performs well with respect to both variable selection and prediction accuracy. Specifically, if the focus is on a limited range of specificity values, the new method results in better predictions than other established techniques for disease classification.
{"title":"A PAUC-based estimation technique for disease classification and biomarker selection.","authors":"Matthias Schmid, Torsten Hothorn, Friedemann Krause, Christina Rabe","doi":"10.1515/1544-6115.1792","DOIUrl":"https://doi.org/10.1515/1544-6115.1792","url":null,"abstract":"<p><p>The partial area under the receiver operating characteristic curve (PAUC) is a well-established performance measure to evaluate biomarker combinations for disease classification. Because the PAUC is defined as the area under the ROC curve within a restricted interval of false positive rates, it enables practitioners to quantify sensitivity rates within pre-specified specificity ranges. This issue is of considerable importance for the development of medical screening tests. Although many authors have highlighted the importance of PAUC, there exist only few methods that use the PAUC as an objective function for finding optimal combinations of biomarkers. In this paper, we introduce a boosting method for deriving marker combinations that is explicitly based on the PAUC criterion. The proposed method can be applied in high-dimensional settings where the number of biomarkers exceeds the number of observations. Additionally, the proposed method incorporates a recently proposed variable selection technique (stability selection) that results in sparse prediction rules incorporating only those biomarkers that make relevant contributions to predicting the outcome of interest. Using both simulated data and real data, we demonstrate that our method performs well with respect to both variable selection and prediction accuracy. Specifically, if the focus is on a limited range of specificity values, the new method results in better predictions than other established techniques for disease classification.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1792","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30951968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The development of next generation genome sequencers gives the opportunity of learning more about the genetic make-up of human and other populations. One important question involves the location of sites at which variation occurs within a population. Our focus will be on the detection of rare variants. Such variants will often not be present in smaller samples and are hard to distinguish from sequencing errors in larger samples. This is particularly true for pooled samples which are often used as part of a cost saving strategy. The focus of this article is on experiments that involve DNA pooling. We derive experimental designs that optimize the power of statistical tests for detecting single nucleotide polymorphisms (SNPs, sites at which there is variation within a population). We also present a new simple test that calls a SNP, if the maximum number of reads of a prospective variant across lanes exceeds a certain threshold. The value of this threshold is defined according to the number of available lanes, the parameters of the genome sequencer and a specified probability of accepting that there is variation at a site when no variation is present. On the basis of this test, we derive pool sizes which are optimal for the detection of rare variants. This test is compared with a likelihood ratio test, which takes into account the number of reads of a prospective variant from all the lanes. It is shown that the threshold based rule achieves a comparable power to this likelihood ratio test and may well be a useful tool in determining near optimal pool sizes for the detection of rare alleles in practical applications.
{"title":"DNA pooling and statistical tests for the detection of single nucleotide polymorphisms.","authors":"David M Ramsey, Andreas Futschik","doi":"10.1515/1544-6115.1763","DOIUrl":"https://doi.org/10.1515/1544-6115.1763","url":null,"abstract":"<p><p>The development of next generation genome sequencers gives the opportunity of learning more about the genetic make-up of human and other populations. One important question involves the location of sites at which variation occurs within a population. Our focus will be on the detection of rare variants. Such variants will often not be present in smaller samples and are hard to distinguish from sequencing errors in larger samples. This is particularly true for pooled samples which are often used as part of a cost saving strategy. The focus of this article is on experiments that involve DNA pooling. We derive experimental designs that optimize the power of statistical tests for detecting single nucleotide polymorphisms (SNPs, sites at which there is variation within a population). We also present a new simple test that calls a SNP, if the maximum number of reads of a prospective variant across lanes exceeds a certain threshold. The value of this threshold is defined according to the number of available lanes, the parameters of the genome sequencer and a specified probability of accepting that there is variation at a site when no variation is present. On the basis of this test, we derive pool sizes which are optimal for the detection of rare variants. This test is compared with a likelihood ratio test, which takes into account the number of reads of a prospective variant from all the lanes. It is shown that the threshold based rule achieves a comparable power to this likelihood ratio test and may well be a useful tool in determining near optimal pool sizes for the detection of rare alleles in practical applications.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"Article 1"},"PeriodicalIF":0.9,"publicationDate":"2012-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1763","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30943216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geert Geeven, Mark J van der Laan, Mathisca C M de Gunst
Gene regulatory networks, in which edges between nodes describe interactions between transcription factors (TFs) and their target genes, model regulatory interactions that determine the cell-type and condition-specific expression of genes. Regression methods can be used to identify TF-target gene interactions from gene expression and DNA sequence data. The response variable, i.e. observed gene expression, is modeled as a function of many predictor variables simultaneously. In practice, it is generally not possible to select a single model that clearly achieves the best fit to the observed experimental data and the selected models typically contain overlapping sets of predictor variables. Moreover, parameters that represent the marginal effect of the individual predictors are not always present. In this paper, we use the statistical framework of estimation of variable importance to define variable importance as a parameter of interest and study two different estimators of this parameter in the context of gene regulatory networks. On yeast data we show that the resulting parameter has a biologically appealing interpretation. We apply the proposed methodology on mammalian gene expression data to gain insight into the temporal activity of TFs that underly gene expression changes in F11 cells in response to Forskolin stimulation.
{"title":"Comparison of targeted maximum likelihood and shrinkage estimators of parameters in gene networks.","authors":"Geert Geeven, Mark J van der Laan, Mathisca C M de Gunst","doi":"10.1515/1544-6115.1728","DOIUrl":"https://doi.org/10.1515/1544-6115.1728","url":null,"abstract":"<p><p>Gene regulatory networks, in which edges between nodes describe interactions between transcription factors (TFs) and their target genes, model regulatory interactions that determine the cell-type and condition-specific expression of genes. Regression methods can be used to identify TF-target gene interactions from gene expression and DNA sequence data. The response variable, i.e. observed gene expression, is modeled as a function of many predictor variables simultaneously. In practice, it is generally not possible to select a single model that clearly achieves the best fit to the observed experimental data and the selected models typically contain overlapping sets of predictor variables. Moreover, parameters that represent the marginal effect of the individual predictors are not always present. In this paper, we use the statistical framework of estimation of variable importance to define variable importance as a parameter of interest and study two different estimators of this parameter in the context of gene regulatory networks. On yeast data we show that the resulting parameter has a biologically appealing interpretation. We apply the proposed methodology on mammalian gene expression data to gain insight into the temporal activity of TFs that underly gene expression changes in F11 cells in response to Forskolin stimulation.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"Article 2"},"PeriodicalIF":0.9,"publicationDate":"2012-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1728","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30943215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We analytically derive the first and second derivatives of the likelihood in maximum likelihood methods for phylogeny. These results enable the Newton-Raphson method to be used for maximising likelihood, which is important because there is a need for faster methods for optimisation of parameters in maximum likelihood methods. Furthermore, the calculation of the Hessian matrix also opens up possibilities for standard likelihood theory to be applied, for inference in phylogeny and for model selection problems. Another application of the Hessian matrix is local influence analysis, which can be used for detecting a number of biologically interesting phenomena. The pruning algorithm has been used to speed up computation of likelihoods for a tree. We explain how it can be used to speed up the computation for the first and second derivatives of the likelihood with respect to branch lengths and other parameters. The results in this paper apply not only to bifurcating trees, but also to general multifurcating trees. We demonstrate the use of our Hessian calculation for the three applications listed above, and compare with existing methods for those applications.
{"title":"Hessian calculation for phylogenetic likelihood based on the pruning algorithm and its applications.","authors":"Toby Kenney, Hong Gu","doi":"10.1515/1544-6115.1779","DOIUrl":"https://doi.org/10.1515/1544-6115.1779","url":null,"abstract":"<p><p>We analytically derive the first and second derivatives of the likelihood in maximum likelihood methods for phylogeny. These results enable the Newton-Raphson method to be used for maximising likelihood, which is important because there is a need for faster methods for optimisation of parameters in maximum likelihood methods. Furthermore, the calculation of the Hessian matrix also opens up possibilities for standard likelihood theory to be applied, for inference in phylogeny and for model selection problems. Another application of the Hessian matrix is local influence analysis, which can be used for detecting a number of biologically interesting phenomena. The pruning algorithm has been used to speed up computation of likelihoods for a tree. We explain how it can be used to speed up the computation for the first and second derivatives of the likelihood with respect to branch lengths and other parameters. The results in this paper apply not only to bifurcating trees, but also to general multifurcating trees. We demonstrate the use of our Hessian calculation for the three applications listed above, and compare with existing methods for those applications.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":"Article 14"},"PeriodicalIF":0.9,"publicationDate":"2012-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1779","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30943876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of association mapping is to identify genetic variants that predict disease, and as the field of human genetics matures, the number of successful association studies is increasing. Many such studies have shown that for many diseases, risk is explained by a reasonably large number of variants that each explains a very small amount of disease risk. This is prompting the use of genetic risk scores in building predictive models, where information across several variants is combined for predictive modeling. In the current study, we compare the performance of four previously proposed genetic risk score methods and present a new method for constructing genetic risk score that incorporates explained variance information. The methods compared include: a simple count Genetic Risk Score, an odds ratio weighted Genetic Risk Score, a direct logistic regression Genetic Risk Score, a polygenic Genetic Risk Score, and the new explained variance weighted Genetic Risk Score. We compare the methods using a wide range of simulations in two steps, with a range of the number of deleterious single nucleotide polymorphisms (SNPs) explaining disease risk, genetic modes, baseline penetrances, sample sizes, relative risks (RR) and minor allele frequencies (MAF). Several measures of model performance were compared including overall power, C-statistic and Akaike's Information Criterion. Our results show the relative performance of methods differs significantly, with the new explained variance weighted GRS (EV-GRS) generally performing favorably to the other methods.
{"title":"A new explained-variance based genetic risk score for predictive modeling of disease risk.","authors":"Ronglin Che, Alison A Motsinger-Reif","doi":"10.1515/1544-6115.1796","DOIUrl":"https://doi.org/10.1515/1544-6115.1796","url":null,"abstract":"<p><p>The goal of association mapping is to identify genetic variants that predict disease, and as the field of human genetics matures, the number of successful association studies is increasing. Many such studies have shown that for many diseases, risk is explained by a reasonably large number of variants that each explains a very small amount of disease risk. This is prompting the use of genetic risk scores in building predictive models, where information across several variants is combined for predictive modeling. In the current study, we compare the performance of four previously proposed genetic risk score methods and present a new method for constructing genetic risk score that incorporates explained variance information. The methods compared include: a simple count Genetic Risk Score, an odds ratio weighted Genetic Risk Score, a direct logistic regression Genetic Risk Score, a polygenic Genetic Risk Score, and the new explained variance weighted Genetic Risk Score. We compare the methods using a wide range of simulations in two steps, with a range of the number of deleterious single nucleotide polymorphisms (SNPs) explaining disease risk, genetic modes, baseline penetrances, sample sizes, relative risks (RR) and minor allele frequencies (MAF). Several measures of model performance were compared including overall power, C-statistic and Akaike's Information Criterion. Our results show the relative performance of methods differs significantly, with the new explained variance weighted GRS (EV-GRS) generally performing favorably to the other methods.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":"Article 15"},"PeriodicalIF":0.9,"publicationDate":"2012-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1796","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30943875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harald Binder, Tina Müller, Holger Schwender, Klaus Golka, Michael Steffens, Jan G Hengstler, Katja Ickstadt, Martin Schumacher
The task of analyzing high-dimensional single nucleotide polymorphism (SNP) data in a case-control design using multivariable techniques has only recently been tackled. While many available approaches investigate only main effects in a high-dimensional setting, we propose a more flexible technique, cluster-localized regression (CLR), based on localized logistic regression models, that allows different SNPs to have an effect for different groups of individuals. Separate multivariable regression models are fitted for the different groups of individuals by incorporating weights into componentwise boosting, which provides simultaneous variable selection, hence sparse fits. For model fitting, these groups of individuals are identified using a clustering approach, where each group may be defined via different SNPs. This allows for representing complex interaction patterns, such as compositional epistasis, that might not be detected by a single main effects model. In a simulation study, the CLR approach results in improved prediction performance, compared to the main effects approach, and identification of important SNPs in several scenarios. Improved prediction performance is also obtained for an application example considering urinary bladder cancer. Some of the identified SNPs are predictive for all individuals, while others are only relevant for a specific group. Together with the sets of SNPs that define the groups, potential interaction patterns are uncovered.
{"title":"Cluster-localized sparse logistic regression for SNP data.","authors":"Harald Binder, Tina Müller, Holger Schwender, Klaus Golka, Michael Steffens, Jan G Hengstler, Katja Ickstadt, Martin Schumacher","doi":"10.1515/1544-6115.1694","DOIUrl":"https://doi.org/10.1515/1544-6115.1694","url":null,"abstract":"<p><p>The task of analyzing high-dimensional single nucleotide polymorphism (SNP) data in a case-control design using multivariable techniques has only recently been tackled. While many available approaches investigate only main effects in a high-dimensional setting, we propose a more flexible technique, cluster-localized regression (CLR), based on localized logistic regression models, that allows different SNPs to have an effect for different groups of individuals. Separate multivariable regression models are fitted for the different groups of individuals by incorporating weights into componentwise boosting, which provides simultaneous variable selection, hence sparse fits. For model fitting, these groups of individuals are identified using a clustering approach, where each group may be defined via different SNPs. This allows for representing complex interaction patterns, such as compositional epistasis, that might not be detected by a single main effects model. In a simulation study, the CLR approach results in improved prediction performance, compared to the main effects approach, and identification of important SNPs in several scenarios. Improved prediction performance is also obtained for an application example considering urinary bladder cancer. Some of the identified SNPs are predictive for all individuals, while others are only relevant for a specific group. Together with the sets of SNPs that define the groups, potential interaction patterns are uncovered.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1694","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30879018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thorsten Dickhaus, Klaus Straßburger, Daniel Schunk, Carlos Morcillo-Suarez, Thomas Illig, Arcadi Navarro
We study exact tests for (2 x 2) and (2 x 3) contingency tables, in particular exact chi-squared tests and exact tests of Fisher type. In practice, these tests are typically carried out without randomization, leading to reproducible results but not exhausting the significance level. We discuss that this can lead to methodological and practical issues in a multiple testing framework when many tables are simultaneously under consideration as in genetic association studies.Realized randomized p-values are proposed as a solution which is especially useful for data-adaptive (plug-in) procedures. These p-values allow to estimate the proportion of true null hypotheses much more accurately than their non-randomized counterparts. Moreover, we address the problem of positively correlated p-values for association by considering techniques to reduce multiplicity by estimating the "effective number of tests" from the correlation structure.An algorithm is provided that bundles all these aspects, efficient computer implementations are made available, a small-scale simulation study is presented and two real data examples are shown.
{"title":"How to analyze many contingency tables simultaneously in genetic association studies.","authors":"Thorsten Dickhaus, Klaus Straßburger, Daniel Schunk, Carlos Morcillo-Suarez, Thomas Illig, Arcadi Navarro","doi":"10.1515/1544-6115.1776","DOIUrl":"https://doi.org/10.1515/1544-6115.1776","url":null,"abstract":"<p><p>We study exact tests for (2 x 2) and (2 x 3) contingency tables, in particular exact chi-squared tests and exact tests of Fisher type. In practice, these tests are typically carried out without randomization, leading to reproducible results but not exhausting the significance level. We discuss that this can lead to methodological and practical issues in a multiple testing framework when many tables are simultaneously under consideration as in genetic association studies.Realized randomized p-values are proposed as a solution which is especially useful for data-adaptive (plug-in) procedures. These p-values allow to estimate the proportion of true null hypotheses much more accurately than their non-randomized counterparts. Moreover, we address the problem of positively correlated p-values for association by considering techniques to reduce multiplicity by estimating the \"effective number of tests\" from the correlation structure.An algorithm is provided that bundles all these aspects, efficient computer implementations are made available, a small-scale simulation study is presented and two real data examples are shown.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1776","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30802134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For the problem of multiple testing, the Benjamini-Hochberg (B-H) procedure has become a very popular method in applications. We show how the B-H procedure can be interpreted as a test based on the spacings corresponding to the p-value distributions. This interpretation leads to the incorporation of the empirical null hypothesis, a term coined by Efron (2004). We develop a mixture modelling approach for the empirical null hypothesis for the B-H procedure and demonstrate some theoretical results regarding both finite-sample as well as asymptotic control of the false discovery rate. The methodology is illustrated with application to two high-throughput datasets as well as to simulated data.
{"title":"Incorporating the empirical null hypothesis into the Benjamini-Hochberg procedure.","authors":"Debashis Ghosh","doi":"10.1515/1544-6115.1735","DOIUrl":"https://doi.org/10.1515/1544-6115.1735","url":null,"abstract":"<p><p>For the problem of multiple testing, the Benjamini-Hochberg (B-H) procedure has become a very popular method in applications. We show how the B-H procedure can be interpreted as a test based on the spacings corresponding to the p-value distributions. This interpretation leads to the incorporation of the empirical null hypothesis, a term coined by Efron (2004). We develop a mixture modelling approach for the empirical null hypothesis for the B-H procedure and demonstrate some theoretical results regarding both finite-sample as well as asymptotic control of the false discovery rate. The methodology is illustrated with application to two high-throughput datasets as well as to simulated data.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1735","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30802138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mutations that confer a selective advantage to an organism are the raw material upon which natural selection acts. The number of such mutations that are available is a central quantity of interest for understanding the tempo and trajectory of adaptive evolution. While this quantity is typically unknown, it can be estimated with varying levels of accuracy based on data obtained experimentally. We propose a method for estimating the number of beneficial mutations that accounts for the evolutionary forces that generate the data. Our model-based parametric approach is compared to an adjusted nonparametric abundance-based coverage estimator. We show that, in general, our estimator performs better. When the number of mutations is small, however, the performances of the two estimators are similar.
{"title":"Estimating the number of one-step beneficial mutations.","authors":"Andrzej J Wojtowicz, Craig R Miller, Paul Joyce","doi":"10.1515/1544-6115.1788","DOIUrl":"https://doi.org/10.1515/1544-6115.1788","url":null,"abstract":"<p><p>Mutations that confer a selective advantage to an organism are the raw material upon which natural selection acts. The number of such mutations that are available is a central quantity of interest for understanding the tempo and trajectory of adaptive evolution. While this quantity is typically unknown, it can be estimated with varying levels of accuracy based on data obtained experimentally. We propose a method for estimating the number of beneficial mutations that accounts for the evolutionary forces that generate the data. Our model-based parametric approach is compared to an adjusted nonparametric abundance-based coverage estimator. We show that, in general, our estimator performs better. When the number of mutations is small, however, the performances of the two estimators are similar.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1788","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30802133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cancer patients often develop multiple malignancies that may be either metastatic spread of a previous cancer (clonal tumors) or new primary cancers (independent tumors). If diagnosis cannot be easily made on the basis of the pathology review, the patterns of somatic mutations in the tumors can be compared. Previously we have developed statistical methods for testing clonality of two tumors using their loss of heterozygosity (LOH) profiles at several candidate markers. These methods can be applied to all possible pairs of tumors when multiple tumors are analyzed, but this strategy can lead to inconsistent results and loss of statistical power. In this work we will extend clonality tests to three and more malignancies from the same patient. A non-parametric test can be performed using any possible subset of tumors, with the subsequent adjustment for multiple testing. A parametric likelihood model is developed for 3 or 4 tumors, and it can be used to estimate the phylogenetic tree of tumors. The proposed tests are more powerful than combination of all possible pairwise tests.
{"title":"Testing clonality of three and more tumors using their loss of heterozygosity profiles.","authors":"Irina Ostrovnaya","doi":"10.1515/1544-6115.1757","DOIUrl":"https://doi.org/10.1515/1544-6115.1757","url":null,"abstract":"<p><p>Cancer patients often develop multiple malignancies that may be either metastatic spread of a previous cancer (clonal tumors) or new primary cancers (independent tumors). If diagnosis cannot be easily made on the basis of the pathology review, the patterns of somatic mutations in the tumors can be compared. Previously we have developed statistical methods for testing clonality of two tumors using their loss of heterozygosity (LOH) profiles at several candidate markers. These methods can be applied to all possible pairs of tumors when multiple tumors are analyzed, but this strategy can lead to inconsistent results and loss of statistical power. In this work we will extend clonality tests to three and more malignancies from the same patient. A non-parametric test can be performed using any possible subset of tumors, with the subsequent adjustment for multiple testing. A parametric likelihood model is developed for 3 or 4 tumors, and it can be used to estimate the phylogenetic tree of tumors. The proposed tests are more powerful than combination of all possible pairwise tests.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1757","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30801577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}