Statistical Applications in Genetics and Molecular Biology最新文献_第9页

A maximum likelihood approach to functional mapping of longitudinal binary traits. 纵向二元特征函数映射的最大似然方法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-11-22 DOI: 10.1515/1544-6115.1675

Chenguang Wang, Hongying Li, Zhong Wang, Yaqun Wang, Ningtao Wang, Zuoheng Wang, Rongling Wu

Despite their importance in biology and biomedicine, genetic mapping of binary traits that change over time has not been well explored. In this article, we develop a statistical model for mapping quantitative trait loci (QTLs) that govern longitudinal responses of binary traits. The model is constructed within the maximum likelihood framework by which the association between binary responses is modeled in terms of conditional log odds-ratios. With this parameterization, the maximum likelihood estimates (MLEs) of marginal mean parameters are robust to the misspecification of time dependence. We implement an iterative procedures to obtain the MLEs of QTL genotype-specific parameters that define longitudinal binary responses. The usefulness of the model was validated by analyzing a real example in rice. Simulation studies were performed to investigate the statistical properties of the model, showing that the model has power to identify and map specific QTLs responsible for the temporal pattern of binary traits.

尽管它们在生物学和生物医学中很重要，但随时间变化的二元性状的遗传作图尚未得到很好的探索。在本文中，我们建立了一个统计模型来绘制控制二元性状纵向响应的数量性状位点(qtl)。该模型是在最大似然框架内构建的，通过该框架，二元响应之间的关联以条件对数比的形式建模。通过这种参数化，边际均值参数的最大似然估计(MLEs)对时间依赖性的错误规范具有鲁棒性。我们实施了一个迭代程序，以获得定义纵向二元响应的QTL基因型特异性参数的最小方差。通过对水稻生产实例的分析，验证了该模型的有效性。通过仿真研究，验证了该模型的统计特性，结果表明该模型具有识别和绘制与二元性状时间模式相关的特定qtl的能力。

{"title":"A maximum likelihood approach to functional mapping of longitudinal binary traits.","authors":"Chenguang Wang, Hongying Li, Zhong Wang, Yaqun Wang, Ningtao Wang, Zuoheng Wang, Rongling Wu","doi":"10.1515/1544-6115.1675","DOIUrl":"https://doi.org/10.1515/1544-6115.1675","url":null,"abstract":"Despite their importance in biology and biomedicine, genetic mapping of binary traits that change over time has not been well explored. In this article, we develop a statistical model for mapping quantitative trait loci (QTLs) that govern longitudinal responses of binary traits. The model is constructed within the maximum likelihood framework by which the association between binary responses is modeled in terms of conditional log odds-ratios. With this parameterization, the maximum likelihood estimates (MLEs) of marginal mean parameters are robust to the misspecification of time dependence. We implement an iterative procedures to obtain the MLEs of QTL genotype-specific parameters that define longitudinal binary responses. The usefulness of the model was validated by analyzing a real example in rice. Simulation studies were performed to investigate the statistical properties of the model, showing that the model has power to identify and map specific QTLs responsible for the temporal pattern of binary traits.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 6","pages":"Article 2"},"PeriodicalIF":0.9,"publicationDate":"2012-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1675","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31076958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Time dependent ROC curves for the estimation of true prognostic capacity of microarray data. 用随时间变化的ROC曲线估计微阵列数据的真实预后能力。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-11-22 DOI: 10.1515/1544-6115.1815

Yohann Foucher, Richard Danger

Microarray data can be used to identify prognostic signatures based on time-to-event data. The analysis of microarrays is often associated with overfitting and many papers have dealt with this issue. However, little attention has been paid to incomplete time-to-event data (truncated and censored follow-up). We have adapted the 0.632+ bootstrap estimator for the evaluation of time-dependent ROC curves. The interpretation of ROC-based results is well-established among the scientific and medical community. Moreover, the results do not depend on the incidence of the event, as opposed to many other prognostic statistics. Here, we have tested this methodology by simulations. We have illustrated its utility by analyzing a data set of diffuse large-B-cell lymphoma patients. Our results demonstrate the well-adapted properties of the 0.632+ ROC-based approach to evaluate the true prognostic capacity of a microarray-based signature. This method has been implemented in an R package ROCt632.

微阵列数据可用于识别基于时间到事件数据的预后特征。微阵列的分析通常与过拟合有关，许多论文都讨论了这个问题。然而，很少注意到不完整的事件时间数据(截断和删节的随访)。我们已经采用了0.632+自举估计器来评估随时间变化的ROC曲线。对roc结果的解释在科学界和医学界是公认的。此外，与许多其他预后统计相反，结果不取决于事件的发生率。在这里，我们通过模拟测试了这种方法。我们通过分析弥漫性大b细胞淋巴瘤患者的数据集来说明其效用。我们的研究结果表明，基于0.632+ roc的方法具有良好的适应性，可用于评估基于微阵列的签名的真实预后能力。该方法已在R包ROCt632中实现。

引用次数: 19

Large-scale parentage inference with SNPs: an efficient algorithm for statistical confidence of parent pair allocations. 基于snp的大规模亲子关系推断:一种有效的父母对分配统计置信度算法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-11-08 DOI: 10.1515/1544-6115.1833

Eric C Anderson

Advances in genotyping that allow tens of thousands of individuals to be genotyped at a moderate number of single nucleotide polymorphisms (SNPs) permit parentage inference to be pursued on a very large scale. The intergenerational tagging this capacity allows is revolutionizing the management of cultured organisms (cows, salmon, etc.) and is poised to do the same for scientific studies of natural populations. Currently, however, there are no likelihood-based methods of parentage inference which are implemented in a manner that allows them to quickly handle a very large number of potential parents or parent pairs. Here we introduce an efficient likelihood-based method applicable to the specialized case of cultured organisms in which both parents can be reliably sampled. We develop a Markov chain representation for the cumulative number of Mendelian incompatibilities between an offspring and its putative parents and we exploit it to develop a fast algorithm for simulation-based estimates of statistical confidence in SNP-based assignments of offspring to pairs of parents. The method is implemented in the freely available software SNPPIT. We describe the method in detail, then assess its performance in a large simulation study using known allele frequencies at 96 SNPs from ten hatchery salmon populations. The simulations verify that the method is fast and accurate and that 96 well-chosen SNPs can provide sufficient power to identify the correct pair of parents from amongst millions of candidate pairs.

基因分型技术的进步使得数以万计的个体可以根据中等数量的单核苷酸多态性(snp)进行基因分型，这使得亲子关系推断可以在非常大的范围内进行。这种代际标记的能力使养殖生物(牛、鲑鱼等)的管理发生了革命性的变化，并准备为自然种群的科学研究做同样的事情。然而，目前还没有基于可能性的亲子关系推断方法，这些方法的实现方式允许他们快速处理大量潜在的父母或父母对。在这里，我们介绍了一种有效的基于似然的方法，适用于培养生物的特殊情况，其中双亲都可以可靠地取样。我们开发了一种马尔可夫链表示，用于后代与其假定父母之间孟德尔不相容的累积数量，并利用它开发了一种快速算法，用于基于snp的后代分配给父母对的统计置信度的模拟估计。该方法在免费软件SNPPIT中实现。我们详细描述了该方法，然后在一项大型模拟研究中评估了其性能，该研究使用了来自10个孵化场鲑鱼种群的96个snp的已知等位基因频率。仿真结果验证了该方法的快速和准确，并且96个精心选择的snp可以提供足够的功率从数百万对候选对中识别正确的父母对。

{"title":"Large-scale parentage inference with SNPs: an efficient algorithm for statistical confidence of parent pair allocations.","authors":"Eric C Anderson","doi":"10.1515/1544-6115.1833","DOIUrl":"https://doi.org/10.1515/1544-6115.1833","url":null,"abstract":"Advances in genotyping that allow tens of thousands of individuals to be genotyped at a moderate number of single nucleotide polymorphisms (SNPs) permit parentage inference to be pursued on a very large scale. The intergenerational tagging this capacity allows is revolutionizing the management of cultured organisms (cows, salmon, etc.) and is poised to do the same for scientific studies of natural populations. Currently, however, there are no likelihood-based methods of parentage inference which are implemented in a manner that allows them to quickly handle a very large number of potential parents or parent pairs. Here we introduce an efficient likelihood-based method applicable to the specialized case of cultured organisms in which both parents can be reliably sampled. We develop a Markov chain representation for the cumulative number of Mendelian incompatibilities between an offspring and its putative parents and we exploit it to develop a fast algorithm for simulation-based estimates of statistical confidence in SNP-based assignments of offspring to pairs of parents. The method is implemented in the freely available software SNPPIT. We describe the method in detail, then assess its performance in a large simulation study using known allele frequencies at 96 SNPs from ten hatchery salmon populations. The simulations verify that the method is fast and accurate and that 96 well-chosen SNPs can provide sufficient power to identify the correct pair of parents from amongst millions of candidate pairs.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1833","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31050246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

ExactDAS: an exact test procedure for the detection of differential alternative splicing in microarray experiments. ExactDAS:一个精确的测试程序，用于检测微阵列实验中不同的选择性剪接。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-11-06 DOI: 10.1515/1544-6115.1814

Tristan Mary-Huard, Florence Jaffrezic, Stéphane Robin

The aim of this paper is to propose a test procedure for the detection of differential alternative splicing across conditions for tiling array or exon chip data. While developed in a mixed model framework, the test procedure is exact (avoiding computational burden) and applicable to a large variety of contrasts, including several previously published ones. A simulation study is presented to evaluate the robustness and performance of the method. It is found to have a good detection power of genes under differential alternative splicing, even for five biological replicates and four probes per exon. The methodology also enables the comparison of various experimental designs through exact power curves. This is illustrated with the comparison of paired and unpaired experiments. The test procedure was applied to two publicly available cancer data sets based on exon arrays, and showed promising results.

本文的目的是提出一种测试程序，用于检测平铺阵列或外显子芯片数据的不同条件下的差异选择性剪接。虽然在混合模型框架中开发，但测试过程是精确的(避免了计算负担)，并且适用于各种各样的对比，包括以前发表的一些对比。仿真研究验证了该方法的鲁棒性和性能。它对不同选择性剪接下的基因有很好的检测能力，甚至对5个生物复制和每个外显子4个探针都有很好的检测能力。该方法还可以通过精确的功率曲线对各种实验设计进行比较。这是通过配对和非配对实验的比较来说明的。该测试程序应用于基于外显子阵列的两个公开可用的癌症数据集，并显示出令人鼓舞的结果。

引用次数: 0

Variational Bayes procedure for effective classification of tumor type with microarray gene expression data. 利用微阵列基因表达数据有效分类肿瘤类型的变分贝叶斯方法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-10-30 DOI: 10.1515/1544-6115.1700

Takeshi Hayashi

Recently, microarrays that can simultaneously measure the expression levels of thousands of genes have become a valuable tool for classifying tumors. For such classification, where the sample size is usually much smaller than the number of genes, it is essential to construct properly sparse models for accurately predicting tumor types to avoid over-fitting. Bayesian shrinkage estimation is considered a suitable method for providing such sparse models, effectively shrinking estimates of the effects for many irrelevant genes to zero while maintaining those of a small number of relevant genes at significant magnitudes. However, Bayesian analysis usually requires time-consuming computational techniques such as computationally intensive MCMC iterations. This paper describes a computationally effective method of Bayesian shrinkage regression (BSR) incorporating multiple hierarchical structures for constructing a classification model for tumor types using microarray gene expression data. We use a variational approximation method which provides simple approximations of posterior distributions of parameters to reduce computational burden in the Bayesian estimation. This computationally efficient BSR procedure yields a properly sparse model for accurately and rapidly classifying tumor samples. The accuracy of tumor classification is shown to be at least equivalent to that of other methods such as support vector machine and partial least squares using simulated and actual gene expression data sets.

最近，可以同时测量数千个基因表达水平的微阵列已经成为分类肿瘤的一个有价值的工具。对于此类分类，样本量通常远小于基因数量，因此必须构建适当的稀疏模型来准确预测肿瘤类型，以避免过拟合。贝叶斯收缩估计被认为是提供这种稀疏模型的合适方法，它有效地将许多不相关基因的影响估计缩小到零，同时将少数相关基因的影响估计保持在显著的幅度上。然而，贝叶斯分析通常需要耗时的计算技术，例如计算密集的MCMC迭代。本文描述了一种计算有效的贝叶斯收缩回归(BSR)方法，该方法结合了多个层次结构，用于利用微阵列基因表达数据构建肿瘤类型分类模型。在贝叶斯估计中，我们使用了一种变分近似方法，该方法提供了参数后验分布的简单近似，以减少计算量。这种计算效率高的BSR程序产生了一个适当的稀疏模型，用于准确和快速地分类肿瘤样本。使用模拟和实际的基因表达数据集，肿瘤分类的准确性至少相当于其他方法，如支持向量机和偏最小二乘。

{"title":"Variational Bayes procedure for effective classification of tumor type with microarray gene expression data.","authors":"Takeshi Hayashi","doi":"10.1515/1544-6115.1700","DOIUrl":"https://doi.org/10.1515/1544-6115.1700","url":null,"abstract":"Recently, microarrays that can simultaneously measure the expression levels of thousands of genes have become a valuable tool for classifying tumors. For such classification, where the sample size is usually much smaller than the number of genes, it is essential to construct properly sparse models for accurately predicting tumor types to avoid over-fitting. Bayesian shrinkage estimation is considered a suitable method for providing such sparse models, effectively shrinking estimates of the effects for many irrelevant genes to zero while maintaining those of a small number of relevant genes at significant magnitudes. However, Bayesian analysis usually requires time-consuming computational techniques such as computationally intensive MCMC iterations. This paper describes a computationally effective method of Bayesian shrinkage regression (BSR) incorporating multiple hierarchical structures for constructing a classification model for tumor types using microarray gene expression data. We use a variational approximation method which provides simple approximations of posterior distributions of parameters to reduce computational burden in the Bayesian estimation. This computationally efficient BSR procedure yields a properly sparse model for accurately and rapidly classifying tumor samples. The accuracy of tumor classification is shown to be at least equivalent to that of other methods such as support vector machine and partial least squares using simulated and actual gene expression data sets.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"Article 9"},"PeriodicalIF":0.9,"publicationDate":"2012-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1700","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31017509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. 利用准似然和缩小的分散估计检测rna序列数据中的差异表达。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-10-22 DOI: 10.1515/1544-6115.1826

Steven P Lund, Dan Nettleton, Davis J McCarthy, Gordon K Smyth

Next generation sequencing technology provides a powerful tool for measuring gene expression (mRNA) levels in the form of RNA-sequence data. Method development for identifying differentially expressed (DE) genes from RNA-seq data, which frequently includes many low-count integers and can exhibit severe overdispersion relative to Poisson or binomial distributions, is a popular area of ongoing research. Here we present quasi-likelihood methods with shrunken dispersion estimates based on an adaptation of Smyth's (2004) approach to estimating gene-specific error variances for microarray data. Our suggested methods are computationally simple, analogous to ANOVA and compare favorably versus competing methods in detecting DE genes and estimating false discovery rates across a variety of simulations based on real data.

下一代测序技术以rna序列数据的形式为测量基因表达(mRNA)水平提供了强大的工具。从RNA-seq数据中识别差异表达(DE)基因的方法开发是一个正在进行的热门研究领域，这些基因通常包括许多低计数整数，并且相对于泊松分布或二项分布可能表现出严重的过分散。在这里，我们提出了基于Smyth(2004)方法的准似然方法，缩小了分散估计，用于估计微阵列数据的基因特异性误差方差。我们建议的方法计算简单，类似于方差分析，并且在检测DE基因和估计基于真实数据的各种模拟的错误发现率方面优于竞争方法。

引用次数: 280

Analyzing genetic association studies with an extended propensity score approach. 用扩展倾向评分法分析遗传关联研究。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-10-19 DOI: 10.1515/1544-6115.1790

Huaqing Zhao, Timothy R Rebbeck, Nandita Mitra

Propensity scores are commonly used to address confounding in observational studies. However, they have not been previously adapted to deal with bias in genetic association studies. We propose an extension of our previous method (Zhao et al., 2009) that uses a multilevel propensity score approach and allows one to estimate the effect of a genotype under an additive model and also simultaneously adjusts for confounders such as genetic ancestry and patient and disease characteristics. Using simulation studies, we demonstrate that this extended genetic propensity score (eGPS) can adequately adjust and consistently correct for bias due to confounding in a variety of circumstances. Under all simulation scenarios, the eGPS method yields estimates with bias close to 0 (mean=0.018, standard error=0.01). Our method also preserves statistical properties such as coverage probability, Type I error, and power. We illustrate this approach in a population-based genetic association study of testicular germ cell tumors and KITLG and SPRY4 susceptibility genes. We conclude that our method provides a novel and broadly applicable analytic strategy for obtaining less biased and more valid estimates of genetic associations.

倾向评分通常用于观察性研究中的混淆。然而，它们以前并没有被用于处理遗传关联研究中的偏见。我们提出了我们之前的方法的扩展(Zhao等人，2009)，该方法使用多层倾向评分方法，允许人们在加性模型下估计基因型的影响，并同时调整遗传血统、患者和疾病特征等混杂因素。通过模拟研究，我们证明了这种扩展的遗传倾向评分(eGPS)可以在各种情况下充分调整和一致地纠正由于混杂引起的偏差。在所有模拟场景下，eGPS方法产生的估计值偏差接近0(平均值=0.018，标准误差=0.01)。我们的方法还保留了统计属性，如覆盖概率、I型错误和功率。我们在一项基于人群的睾丸生殖细胞肿瘤与KITLG和SPRY4易感基因的遗传关联研究中说明了这种方法。我们的结论是，我们的方法提供了一种新的和广泛适用的分析策略，以获得更少的偏见和更有效的遗传关联估计。

{"title":"Analyzing genetic association studies with an extended propensity score approach.","authors":"Huaqing Zhao, Timothy R Rebbeck, Nandita Mitra","doi":"10.1515/1544-6115.1790","DOIUrl":"https://doi.org/10.1515/1544-6115.1790","url":null,"abstract":"Propensity scores are commonly used to address confounding in observational studies. However, they have not been previously adapted to deal with bias in genetic association studies. We propose an extension of our previous method (Zhao et al., 2009) that uses a multilevel propensity score approach and allows one to estimate the effect of a genotype under an additive model and also simultaneously adjusts for confounders such as genetic ancestry and patient and disease characteristics. Using simulation studies, we demonstrate that this extended genetic propensity score (eGPS) can adequately adjust and consistently correct for bias due to confounding in a variety of circumstances. Under all simulation scenarios, the eGPS method yields estimates with bias close to 0 (mean=0.018, standard error=0.01). Our method also preserves statistical properties such as coverage probability, Type I error, and power. We illustrate this approach in a population-based genetic association study of testicular germ cell tumors and KITLG and SPRY4 susceptibility genes. We conclude that our method provides a novel and broadly applicable analytic strategy for obtaining less biased and more valid estimates of genetic associations.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1790","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31006389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Empirical bayesian selection of hypothesis testing procedures for analysis of sequence count expression data. 经验贝叶斯选择的假设检验程序，用于分析序列计数表达数据。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-10-19 DOI: 10.1515/1544-6115.1773

Stanley B Pounds, Cuilan L Gao, Hui Zhang

Differential expression analysis of sequence-count expression data involves performing a large number of hypothesis tests that compare the expression count data of each gene or transcript across two or more biological conditions. The assumptions of any specific hypothesis-testing method will probably not be valid for each of a very large number of genes. Thus, computational evaluation of assumptions should be incorporated into the analysis to select an appropriate hypothesis-testing method for each gene. Here, we generalize earlier work to introduce two novel procedures that use estimates of the empirical Bayesian probability (EBP) of overdispersion to select or combine results of a standard Poisson likelihood ratio test and a quasi-likelihood test for each gene. These EBP-based procedures simultaneously evaluate the Poisson-distribution assumption and account for multiple testing. With adequate power to detect overdispersion, the new procedures select the standard likelihood test for each gene with Poisson-distributed counts and the quasi-likelihood test for each gene with overdispersed counts. The new procedures outperformed previously published methods in many simulation studies. We also present a real-data analysis example and discuss how the framework used to develop the new procedures may be generalized to further enhance performance. An R code library that implements the methods is freely available at www.stjuderesearch.org/depts/biostats/software.

序列计数表达数据的差异表达分析涉及执行大量的假设检验，比较每个基因或转录物在两种或多种生物学条件下的表达计数数据。任何特定的假设检验方法的假设可能对大量基因中的每一个都无效。因此，应将假设的计算评估纳入分析，为每个基因选择合适的假设检验方法。在这里，我们推广了早期的工作，引入了两种新的程序，它们使用经验贝叶斯概率(EBP)的估计来选择或组合每个基因的标准泊松似然比检验和准似然检验的结果。这些基于ebp的程序同时评估泊松分布假设并考虑多重测试。有足够的能力来检测过分散，新程序选择标准似然检验每个基因与泊松分布计数和准似然检验每个基因与过分散计数。在许多模拟研究中，新程序优于先前发表的方法。我们还提供了一个实际数据分析示例，并讨论了如何将用于开发新过程的框架推广到进一步提高性能。实现这些方法的R代码库可以在www.stjuderesearch.org/depts/biostats/software上免费获得。

{"title":"Empirical bayesian selection of hypothesis testing procedures for analysis of sequence count expression data.","authors":"Stanley B Pounds, Cuilan L Gao, Hui Zhang","doi":"10.1515/1544-6115.1773","DOIUrl":"https://doi.org/10.1515/1544-6115.1773","url":null,"abstract":"Differential expression analysis of sequence-count expression data involves performing a large number of hypothesis tests that compare the expression count data of each gene or transcript across two or more biological conditions. The assumptions of any specific hypothesis-testing method will probably not be valid for each of a very large number of genes. Thus, computational evaluation of assumptions should be incorporated into the analysis to select an appropriate hypothesis-testing method for each gene. Here, we generalize earlier work to introduce two novel procedures that use estimates of the empirical Bayesian probability (EBP) of overdispersion to select or combine results of a standard Poisson likelihood ratio test and a quasi-likelihood test for each gene. These EBP-based procedures simultaneously evaluate the Poisson-distribution assumption and account for multiple testing. With adequate power to detect overdispersion, the new procedures select the standard likelihood test for each gene with Poisson-distributed counts and the quasi-likelihood test for each gene with overdispersed counts. The new procedures outperformed previously published methods in many simulation studies. We also present a real-data analysis example and discuss how the framework used to develop the new procedures may be generalized to further enhance performance. An R code library that implements the methods is freely available at www.stjuderesearch.org/depts/biostats/software.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1773","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31008004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Estimators of the local false discovery rate designed for small numbers of tests. 为少量测试设计的局部错误发现率估计器。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-10-12 DOI: 10.1515/1544-6115.1807

Marta Padilla, David R Bickel

Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.

基于直方图的经验贝叶斯方法用于分析大量基因，snp或其他生物特征的数据，当应用于具有少量特征的数据时，如常规测量表达的基因，蛋白质和代谢物，往往具有较大的偏差。为了在经验贝叶斯框架中分析这些小尺度和中等尺度的数据，我们引入了局部错误发现率(LFDR)的最大似然估计量(MLEs)的修正。在这种情况下，MLE通过估计先验分布来估计LFDR，即零假设真值的后验概率。校正在于在估计先验所依赖的一个或多个参数时排除每个特征。此外，我们提出了期望LFDR (ELFDR)，以传播估计先验所涉及的不确定性。我们还引入了一种最优加权组合，将最好的修正mle与先前的估计器结合起来，该估计器基于二项分布，不需要跨特征的数据分布的参数模型。新的估计器和以前的估计器对蛋白质丰度数据的应用说明了不同的估计器在多大程度上导致关于哪些蛋白质受癌症影响的不同结论。进行了模拟研究，以近似新的估计器相对于以前的LFDR估计器的偏差。对两种不同数量的特征(N)、两种不同的非中心性参数值或可检测性水平(dalt)以及几种未受影响的特征(p0)的比例进行数据模拟。其中一个先前的估计器是基于直方图的估计器(HBE)，它是为大量的特征而设计的。仿真结果表明，部分修正后的MLE和修正了HBE的ELFDR分别减少了相对于MLE和HBE的负偏置。对于每一种方法，我们都将最坏情况的性能定义为两个不同的数据和不同的p0上的偏差绝对值的最大值。最佳最坏情况方法代表在给定条件下使用的最安全的方法。分析表明，对于高p0和N = 3,12，基于二项的方法具有最低的最坏情况绝对偏差。然而，当p0的值更不确定时，基于最小描述长度(MDL)原则的修正MLE是最佳最坏情况方法，因为它在p0的所有可能值中具有最低的最坏情况偏差之一，并且N = 3,12。因此，考虑的最安全的估计量是基于二项式的方法，当可以假设高比例的未受影响的特征时，否则是基于mdl的方法。我们发现HBE需要N至少有6-12个特征才能达到这里提出的估计器的效果，精确的最小N取决于p0和dalt。

{"title":"Estimators of the local false discovery rate designed for small numbers of tests.","authors":"Marta Padilla, David R Bickel","doi":"10.1515/1544-6115.1807","DOIUrl":"https://doi.org/10.1515/1544-6115.1807","url":null,"abstract":"Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"4"},"PeriodicalIF":0.9,"publicationDate":"2012-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1807","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30988559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Genotype copy number variations using Gaussian mixture models: theory and algorithms. 使用高斯混合模型的基因型拷贝数变化:理论和算法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2012-10-12 DOI: 10.1515/1544-6115.1725

Chang-Yun Lin, Yungtai Lo, Kenny Q Ye

Copy number variations (CNVs) are important in the disease association studies and are usually targeted by most recent microarray platforms developed for GWAS studies. However, the probes targeting the same CNV regions could vary greatly in performance, with some of the probes carrying little information more than pure noise. In this paper, we investigate how to best combine measurements of multiple probes to estimate copy numbers of individuals under the framework of Gaussian mixture model (GMM). First we show that under two regularity conditions and assume all the parameters except the mixing proportions are known, optimal weights can be obtained so that the univariate GMM based on the weighted average gives the exactly the same classification as the multivariate GMM does. We then developed an algorithm that iteratively estimates the parameters and obtains the optimal weights, and uses them for classification. The algorithm performs well on simulation data and two sets of real data, which shows clear advantage over classification based on the equal weighted average.

拷贝数变异(CNVs)在疾病关联研究中很重要，并且通常是为GWAS研究开发的最新微阵列平台的靶标。然而，针对相同CNV区域的探针在性能上可能会有很大差异，其中一些探针除了纯粹的噪声之外几乎没有携带任何信息。本文研究了在高斯混合模型(GMM)框架下，如何最好地结合多个探针的测量值来估计个体的拷贝数。首先，我们证明了在两种规则条件下，假设除混合比例外的所有参数都已知，可以获得最优权重，从而使基于加权平均的单变量GMM与多变量GMM给出完全相同的分类。然后，我们开发了一种迭代估计参数并获得最优权重的算法，并将其用于分类。该算法在模拟数据和两组真实数据上表现良好，明显优于基于等加权平均的分类方法。

{"title":"Genotype copy number variations using Gaussian mixture models: theory and algorithms.","authors":"Chang-Yun Lin, Yungtai Lo, Kenny Q Ye","doi":"10.1515/1544-6115.1725","DOIUrl":"https://doi.org/10.1515/1544-6115.1725","url":null,"abstract":"Copy number variations (CNVs) are important in the disease association studies and are usually targeted by most recent microarray platforms developed for GWAS studies. However, the probes targeting the same CNV regions could vary greatly in performance, with some of the probes carrying little information more than pure noise. In this paper, we investigate how to best combine measurements of multiple probes to estimate copy numbers of individuals under the framework of Gaussian mixture model (GMM). First we show that under two regularity conditions and assume all the parameters except the mixing proportions are known, optimal weights can be obtained so that the univariate GMM based on the weighted average gives the exactly the same classification as the multivariate GMM does. We then developed an algorithm that iteratively estimates the parameters and obtains the optimal weights, and uses them for classification. The algorithm performs well on simulation data and two sets of real data, which shows clear advantage over classification based on the equal weighted average.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"5"},"PeriodicalIF":0.9,"publicationDate":"2012-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1725","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30988558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7