Despite their importance in biology and biomedicine, genetic mapping of binary traits that change over time has not been well explored. In this article, we develop a statistical model for mapping quantitative trait loci (QTLs) that govern longitudinal responses of binary traits. The model is constructed within the maximum likelihood framework by which the association between binary responses is modeled in terms of conditional log odds-ratios. With this parameterization, the maximum likelihood estimates (MLEs) of marginal mean parameters are robust to the misspecification of time dependence. We implement an iterative procedures to obtain the MLEs of QTL genotype-specific parameters that define longitudinal binary responses. The usefulness of the model was validated by analyzing a real example in rice. Simulation studies were performed to investigate the statistical properties of the model, showing that the model has power to identify and map specific QTLs responsible for the temporal pattern of binary traits.
{"title":"A maximum likelihood approach to functional mapping of longitudinal binary traits.","authors":"Chenguang Wang, Hongying Li, Zhong Wang, Yaqun Wang, Ningtao Wang, Zuoheng Wang, Rongling Wu","doi":"10.1515/1544-6115.1675","DOIUrl":"https://doi.org/10.1515/1544-6115.1675","url":null,"abstract":"<p><p>Despite their importance in biology and biomedicine, genetic mapping of binary traits that change over time has not been well explored. In this article, we develop a statistical model for mapping quantitative trait loci (QTLs) that govern longitudinal responses of binary traits. The model is constructed within the maximum likelihood framework by which the association between binary responses is modeled in terms of conditional log odds-ratios. With this parameterization, the maximum likelihood estimates (MLEs) of marginal mean parameters are robust to the misspecification of time dependence. We implement an iterative procedures to obtain the MLEs of QTL genotype-specific parameters that define longitudinal binary responses. The usefulness of the model was validated by analyzing a real example in rice. Simulation studies were performed to investigate the statistical properties of the model, showing that the model has power to identify and map specific QTLs responsible for the temporal pattern of binary traits.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 6","pages":"Article 2"},"PeriodicalIF":0.9,"publicationDate":"2012-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1675","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31076958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Microarray data can be used to identify prognostic signatures based on time-to-event data. The analysis of microarrays is often associated with overfitting and many papers have dealt with this issue. However, little attention has been paid to incomplete time-to-event data (truncated and censored follow-up). We have adapted the 0.632+ bootstrap estimator for the evaluation of time-dependent ROC curves. The interpretation of ROC-based results is well-established among the scientific and medical community. Moreover, the results do not depend on the incidence of the event, as opposed to many other prognostic statistics. Here, we have tested this methodology by simulations. We have illustrated its utility by analyzing a data set of diffuse large-B-cell lymphoma patients. Our results demonstrate the well-adapted properties of the 0.632+ ROC-based approach to evaluate the true prognostic capacity of a microarray-based signature. This method has been implemented in an R package ROCt632.
{"title":"Time dependent ROC curves for the estimation of true prognostic capacity of microarray data.","authors":"Yohann Foucher, Richard Danger","doi":"10.1515/1544-6115.1815","DOIUrl":"https://doi.org/10.1515/1544-6115.1815","url":null,"abstract":"<p><p>Microarray data can be used to identify prognostic signatures based on time-to-event data. The analysis of microarrays is often associated with overfitting and many papers have dealt with this issue. However, little attention has been paid to incomplete time-to-event data (truncated and censored follow-up). We have adapted the 0.632+ bootstrap estimator for the evaluation of time-dependent ROC curves. The interpretation of ROC-based results is well-established among the scientific and medical community. Moreover, the results do not depend on the incidence of the event, as opposed to many other prognostic statistics. Here, we have tested this methodology by simulations. We have illustrated its utility by analyzing a data set of diffuse large-B-cell lymphoma patients. Our results demonstrate the well-adapted properties of the 0.632+ ROC-based approach to evaluate the true prognostic capacity of a microarray-based signature. This method has been implemented in an R package ROCt632.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 6","pages":"Article 1"},"PeriodicalIF":0.9,"publicationDate":"2012-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1815","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31076959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advances in genotyping that allow tens of thousands of individuals to be genotyped at a moderate number of single nucleotide polymorphisms (SNPs) permit parentage inference to be pursued on a very large scale. The intergenerational tagging this capacity allows is revolutionizing the management of cultured organisms (cows, salmon, etc.) and is poised to do the same for scientific studies of natural populations. Currently, however, there are no likelihood-based methods of parentage inference which are implemented in a manner that allows them to quickly handle a very large number of potential parents or parent pairs. Here we introduce an efficient likelihood-based method applicable to the specialized case of cultured organisms in which both parents can be reliably sampled. We develop a Markov chain representation for the cumulative number of Mendelian incompatibilities between an offspring and its putative parents and we exploit it to develop a fast algorithm for simulation-based estimates of statistical confidence in SNP-based assignments of offspring to pairs of parents. The method is implemented in the freely available software SNPPIT. We describe the method in detail, then assess its performance in a large simulation study using known allele frequencies at 96 SNPs from ten hatchery salmon populations. The simulations verify that the method is fast and accurate and that 96 well-chosen SNPs can provide sufficient power to identify the correct pair of parents from amongst millions of candidate pairs.
{"title":"Large-scale parentage inference with SNPs: an efficient algorithm for statistical confidence of parent pair allocations.","authors":"Eric C Anderson","doi":"10.1515/1544-6115.1833","DOIUrl":"https://doi.org/10.1515/1544-6115.1833","url":null,"abstract":"<p><p>Advances in genotyping that allow tens of thousands of individuals to be genotyped at a moderate number of single nucleotide polymorphisms (SNPs) permit parentage inference to be pursued on a very large scale. The intergenerational tagging this capacity allows is revolutionizing the management of cultured organisms (cows, salmon, etc.) and is poised to do the same for scientific studies of natural populations. Currently, however, there are no likelihood-based methods of parentage inference which are implemented in a manner that allows them to quickly handle a very large number of potential parents or parent pairs. Here we introduce an efficient likelihood-based method applicable to the specialized case of cultured organisms in which both parents can be reliably sampled. We develop a Markov chain representation for the cumulative number of Mendelian incompatibilities between an offspring and its putative parents and we exploit it to develop a fast algorithm for simulation-based estimates of statistical confidence in SNP-based assignments of offspring to pairs of parents. The method is implemented in the freely available software SNPPIT. We describe the method in detail, then assess its performance in a large simulation study using known allele frequencies at 96 SNPs from ten hatchery salmon populations. The simulations verify that the method is fast and accurate and that 96 well-chosen SNPs can provide sufficient power to identify the correct pair of parents from amongst millions of candidate pairs.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1833","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31050246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tristan Mary-Huard, Florence Jaffrezic, Stéphane Robin
The aim of this paper is to propose a test procedure for the detection of differential alternative splicing across conditions for tiling array or exon chip data. While developed in a mixed model framework, the test procedure is exact (avoiding computational burden) and applicable to a large variety of contrasts, including several previously published ones. A simulation study is presented to evaluate the robustness and performance of the method. It is found to have a good detection power of genes under differential alternative splicing, even for five biological replicates and four probes per exon. The methodology also enables the comparison of various experimental designs through exact power curves. This is illustrated with the comparison of paired and unpaired experiments. The test procedure was applied to two publicly available cancer data sets based on exon arrays, and showed promising results.
{"title":"ExactDAS: an exact test procedure for the detection of differential alternative splicing in microarray experiments.","authors":"Tristan Mary-Huard, Florence Jaffrezic, Stéphane Robin","doi":"10.1515/1544-6115.1814","DOIUrl":"https://doi.org/10.1515/1544-6115.1814","url":null,"abstract":"<p><p>The aim of this paper is to propose a test procedure for the detection of differential alternative splicing across conditions for tiling array or exon chip data. While developed in a mixed model framework, the test procedure is exact (avoiding computational burden) and applicable to a large variety of contrasts, including several previously published ones. A simulation study is presented to evaluate the robustness and performance of the method. It is found to have a good detection power of genes under differential alternative splicing, even for five biological replicates and four probes per exon. The methodology also enables the comparison of various experimental designs through exact power curves. This is illustrated with the comparison of paired and unpaired experiments. The test procedure was applied to two publicly available cancer data sets based on exon arrays, and showed promising results.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1814","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31050245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, microarrays that can simultaneously measure the expression levels of thousands of genes have become a valuable tool for classifying tumors. For such classification, where the sample size is usually much smaller than the number of genes, it is essential to construct properly sparse models for accurately predicting tumor types to avoid over-fitting. Bayesian shrinkage estimation is considered a suitable method for providing such sparse models, effectively shrinking estimates of the effects for many irrelevant genes to zero while maintaining those of a small number of relevant genes at significant magnitudes. However, Bayesian analysis usually requires time-consuming computational techniques such as computationally intensive MCMC iterations. This paper describes a computationally effective method of Bayesian shrinkage regression (BSR) incorporating multiple hierarchical structures for constructing a classification model for tumor types using microarray gene expression data. We use a variational approximation method which provides simple approximations of posterior distributions of parameters to reduce computational burden in the Bayesian estimation. This computationally efficient BSR procedure yields a properly sparse model for accurately and rapidly classifying tumor samples. The accuracy of tumor classification is shown to be at least equivalent to that of other methods such as support vector machine and partial least squares using simulated and actual gene expression data sets.
{"title":"Variational Bayes procedure for effective classification of tumor type with microarray gene expression data.","authors":"Takeshi Hayashi","doi":"10.1515/1544-6115.1700","DOIUrl":"https://doi.org/10.1515/1544-6115.1700","url":null,"abstract":"<p><p>Recently, microarrays that can simultaneously measure the expression levels of thousands of genes have become a valuable tool for classifying tumors. For such classification, where the sample size is usually much smaller than the number of genes, it is essential to construct properly sparse models for accurately predicting tumor types to avoid over-fitting. Bayesian shrinkage estimation is considered a suitable method for providing such sparse models, effectively shrinking estimates of the effects for many irrelevant genes to zero while maintaining those of a small number of relevant genes at significant magnitudes. However, Bayesian analysis usually requires time-consuming computational techniques such as computationally intensive MCMC iterations. This paper describes a computationally effective method of Bayesian shrinkage regression (BSR) incorporating multiple hierarchical structures for constructing a classification model for tumor types using microarray gene expression data. We use a variational approximation method which provides simple approximations of posterior distributions of parameters to reduce computational burden in the Bayesian estimation. This computationally efficient BSR procedure yields a properly sparse model for accurately and rapidly classifying tumor samples. The accuracy of tumor classification is shown to be at least equivalent to that of other methods such as support vector machine and partial least squares using simulated and actual gene expression data sets.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"Article 9"},"PeriodicalIF":0.9,"publicationDate":"2012-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1700","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31017509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven P Lund, Dan Nettleton, Davis J McCarthy, Gordon K Smyth
Next generation sequencing technology provides a powerful tool for measuring gene expression (mRNA) levels in the form of RNA-sequence data. Method development for identifying differentially expressed (DE) genes from RNA-seq data, which frequently includes many low-count integers and can exhibit severe overdispersion relative to Poisson or binomial distributions, is a popular area of ongoing research. Here we present quasi-likelihood methods with shrunken dispersion estimates based on an adaptation of Smyth's (2004) approach to estimating gene-specific error variances for microarray data. Our suggested methods are computationally simple, analogous to ANOVA and compare favorably versus competing methods in detecting DE genes and estimating false discovery rates across a variety of simulations based on real data.
{"title":"Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates.","authors":"Steven P Lund, Dan Nettleton, Davis J McCarthy, Gordon K Smyth","doi":"10.1515/1544-6115.1826","DOIUrl":"https://doi.org/10.1515/1544-6115.1826","url":null,"abstract":"<p><p>Next generation sequencing technology provides a powerful tool for measuring gene expression (mRNA) levels in the form of RNA-sequence data. Method development for identifying differentially expressed (DE) genes from RNA-seq data, which frequently includes many low-count integers and can exhibit severe overdispersion relative to Poisson or binomial distributions, is a popular area of ongoing research. Here we present quasi-likelihood methods with shrunken dispersion estimates based on an adaptation of Smyth's (2004) approach to estimating gene-specific error variances for microarray data. Our suggested methods are computationally simple, analogous to ANOVA and compare favorably versus competing methods in detecting DE genes and estimating false discovery rates across a variety of simulations based on real data.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1826","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31008005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Propensity scores are commonly used to address confounding in observational studies. However, they have not been previously adapted to deal with bias in genetic association studies. We propose an extension of our previous method (Zhao et al., 2009) that uses a multilevel propensity score approach and allows one to estimate the effect of a genotype under an additive model and also simultaneously adjusts for confounders such as genetic ancestry and patient and disease characteristics. Using simulation studies, we demonstrate that this extended genetic propensity score (eGPS) can adequately adjust and consistently correct for bias due to confounding in a variety of circumstances. Under all simulation scenarios, the eGPS method yields estimates with bias close to 0 (mean=0.018, standard error=0.01). Our method also preserves statistical properties such as coverage probability, Type I error, and power. We illustrate this approach in a population-based genetic association study of testicular germ cell tumors and KITLG and SPRY4 susceptibility genes. We conclude that our method provides a novel and broadly applicable analytic strategy for obtaining less biased and more valid estimates of genetic associations.
{"title":"Analyzing genetic association studies with an extended propensity score approach.","authors":"Huaqing Zhao, Timothy R Rebbeck, Nandita Mitra","doi":"10.1515/1544-6115.1790","DOIUrl":"https://doi.org/10.1515/1544-6115.1790","url":null,"abstract":"<p><p>Propensity scores are commonly used to address confounding in observational studies. However, they have not been previously adapted to deal with bias in genetic association studies. We propose an extension of our previous method (Zhao et al., 2009) that uses a multilevel propensity score approach and allows one to estimate the effect of a genotype under an additive model and also simultaneously adjusts for confounders such as genetic ancestry and patient and disease characteristics. Using simulation studies, we demonstrate that this extended genetic propensity score (eGPS) can adequately adjust and consistently correct for bias due to confounding in a variety of circumstances. Under all simulation scenarios, the eGPS method yields estimates with bias close to 0 (mean=0.018, standard error=0.01). Our method also preserves statistical properties such as coverage probability, Type I error, and power. We illustrate this approach in a population-based genetic association study of testicular germ cell tumors and KITLG and SPRY4 susceptibility genes. We conclude that our method provides a novel and broadly applicable analytic strategy for obtaining less biased and more valid estimates of genetic associations.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1790","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31006389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential expression analysis of sequence-count expression data involves performing a large number of hypothesis tests that compare the expression count data of each gene or transcript across two or more biological conditions. The assumptions of any specific hypothesis-testing method will probably not be valid for each of a very large number of genes. Thus, computational evaluation of assumptions should be incorporated into the analysis to select an appropriate hypothesis-testing method for each gene. Here, we generalize earlier work to introduce two novel procedures that use estimates of the empirical Bayesian probability (EBP) of overdispersion to select or combine results of a standard Poisson likelihood ratio test and a quasi-likelihood test for each gene. These EBP-based procedures simultaneously evaluate the Poisson-distribution assumption and account for multiple testing. With adequate power to detect overdispersion, the new procedures select the standard likelihood test for each gene with Poisson-distributed counts and the quasi-likelihood test for each gene with overdispersed counts. The new procedures outperformed previously published methods in many simulation studies. We also present a real-data analysis example and discuss how the framework used to develop the new procedures may be generalized to further enhance performance. An R code library that implements the methods is freely available at www.stjuderesearch.org/depts/biostats/software.
{"title":"Empirical bayesian selection of hypothesis testing procedures for analysis of sequence count expression data.","authors":"Stanley B Pounds, Cuilan L Gao, Hui Zhang","doi":"10.1515/1544-6115.1773","DOIUrl":"https://doi.org/10.1515/1544-6115.1773","url":null,"abstract":"<p><p>Differential expression analysis of sequence-count expression data involves performing a large number of hypothesis tests that compare the expression count data of each gene or transcript across two or more biological conditions. The assumptions of any specific hypothesis-testing method will probably not be valid for each of a very large number of genes. Thus, computational evaluation of assumptions should be incorporated into the analysis to select an appropriate hypothesis-testing method for each gene. Here, we generalize earlier work to introduce two novel procedures that use estimates of the empirical Bayesian probability (EBP) of overdispersion to select or combine results of a standard Poisson likelihood ratio test and a quasi-likelihood test for each gene. These EBP-based procedures simultaneously evaluate the Poisson-distribution assumption and account for multiple testing. With adequate power to detect overdispersion, the new procedures select the standard likelihood test for each gene with Poisson-distributed counts and the quasi-likelihood test for each gene with overdispersed counts. The new procedures outperformed previously published methods in many simulation studies. We also present a real-data analysis example and discuss how the framework used to develop the new procedures may be generalized to further enhance performance. An R code library that implements the methods is freely available at www.stjuderesearch.org/depts/biostats/software.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1773","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31008004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.
{"title":"Estimators of the local false discovery rate designed for small numbers of tests.","authors":"Marta Padilla, David R Bickel","doi":"10.1515/1544-6115.1807","DOIUrl":"https://doi.org/10.1515/1544-6115.1807","url":null,"abstract":"<p><p>Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"4"},"PeriodicalIF":0.9,"publicationDate":"2012-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1807","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30988559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Copy number variations (CNVs) are important in the disease association studies and are usually targeted by most recent microarray platforms developed for GWAS studies. However, the probes targeting the same CNV regions could vary greatly in performance, with some of the probes carrying little information more than pure noise. In this paper, we investigate how to best combine measurements of multiple probes to estimate copy numbers of individuals under the framework of Gaussian mixture model (GMM). First we show that under two regularity conditions and assume all the parameters except the mixing proportions are known, optimal weights can be obtained so that the univariate GMM based on the weighted average gives the exactly the same classification as the multivariate GMM does. We then developed an algorithm that iteratively estimates the parameters and obtains the optimal weights, and uses them for classification. The algorithm performs well on simulation data and two sets of real data, which shows clear advantage over classification based on the equal weighted average.
{"title":"Genotype copy number variations using Gaussian mixture models: theory and algorithms.","authors":"Chang-Yun Lin, Yungtai Lo, Kenny Q Ye","doi":"10.1515/1544-6115.1725","DOIUrl":"https://doi.org/10.1515/1544-6115.1725","url":null,"abstract":"<p><p>Copy number variations (CNVs) are important in the disease association studies and are usually targeted by most recent microarray platforms developed for GWAS studies. However, the probes targeting the same CNV regions could vary greatly in performance, with some of the probes carrying little information more than pure noise. In this paper, we investigate how to best combine measurements of multiple probes to estimate copy numbers of individuals under the framework of Gaussian mixture model (GMM). First we show that under two regularity conditions and assume all the parameters except the mixing proportions are known, optimal weights can be obtained so that the univariate GMM based on the weighted average gives the exactly the same classification as the multivariate GMM does. We then developed an algorithm that iteratively estimates the parameters and obtains the optimal weights, and uses them for classification. The algorithm performs well on simulation data and two sets of real data, which shows clear advantage over classification based on the equal weighted average.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"5"},"PeriodicalIF":0.9,"publicationDate":"2012-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1725","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30988558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}