Statistical Applications in Genetics and Molecular Biology最新文献_第5页

Distinct characteristics of correlation analysis at the single-cell and the population level 单细胞水平和群体水平相关分析的显著特征

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-08-19 DOI: 10.21203/rs.3.rs-42825/v1

Guoyu Wu, Yuchao Li

Abstract Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.

相关分析在生物学研究中被广泛用于推断生物网络中的分子关系。最近，单细胞分析已经引起了极大的兴趣，因为它能够获得高分辨率的分子表型。结果表明，在单细胞水平调查中发现的共表达基因与群体水平调查中发现的共表达基因几乎没有重叠。然而，单细胞水平和群体水平之间的相关性关系的本质仍不清楚。在这篇文章中，我们旨在揭示单细胞水平上的相关系数与群体水平上的相关系数差异的来源，并弥合它们之间的差距。通过开发将单细胞和种群水平的相关性联系起来的公式，我们说明了，根据种群内的变异和相关性，聚合相关性可能比相应的个体相关性更强、更弱或等于个体相关性。当群体内相关性弱于个体相关性时，总体相关性强于相应的个体相关性。此外，我们的数据表明，总体相关性可能比相应的个体相关性更强，并且很少发现基因对在单细胞水平上完全强相关。通过自下而上的方法来模拟信号级联分子之间的相互作用或多调节因子控制的基因表达，我们惊讶地发现，不能简单地基于它们的低相关系数来排除两个组分之间相互作用的存在，这表明重新考虑生物网络中仅由相关分析得出的连性。我们还研究了技术随机测量误差对单细胞水平和种群水平相关系数的影响。结果表明，综合相关性具有较强的鲁棒性，受影响较小。由于单细胞间的异质性，根据单细胞水平计算的相关系数可能与群体水平计算的相关系数不同。根据我们所问的具体问题，在我们得出任何结论之前，应该进行适当的抽样和归一化程序。

{"title":"Distinct characteristics of correlation analysis at the single-cell and the population level","authors":"Guoyu Wu, Yuchao Li","doi":"10.21203/rs.3.rs-42825/v1","DOIUrl":"https://doi.org/10.21203/rs.3.rs-42825/v1","url":null,"abstract":"Abstract Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"0 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41900756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accuracy and sensitivity of different Bayesian methods for genomic prediction using simulation and real data. 不同贝叶斯方法在基因组预测中的准确性和敏感性。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-08-10 DOI: 10.1515/sagmb-2019-0007

Saheb Foroutaifar

The main objectives of this study were to compare the prediction accuracy of different Bayesian methods for traits with a wide range of genetic architecture using simulation and real data and to assess the sensitivity of these methods to the violation of their assumptions. For the simulation study, different scenarios were implemented based on two traits with low or high heritability and different numbers of QTL and the distribution of their effects. For real data analysis, a German Holstein dataset for milk fat percentage, milk yield, and somatic cell score was used. The simulation results showed that, with the exception of the Bayes R, the other methods were sensitive to changes in the number of QTLs and distribution of QTL effects. Having a distribution of QTL effects, similar to what different Bayesian methods assume for estimating marker effects, did not improve their prediction accuracy. The Bayes B method gave higher or equal accuracy rather than the rest. The real data analysis showed that similar to scenarios with a large number of QTLs in the simulation, there was no difference between the accuracies of the different methods for any of the traits.

本研究的主要目的是利用模拟和真实数据比较不同贝叶斯方法对具有广泛遗传结构的性状的预测精度，并评估这些方法对违反其假设的敏感性。在模拟研究中，根据遗传力低或高的两个性状、不同的QTL数量及其效应分布，实施不同的情景。对于实际数据分析，使用了德国荷尔斯坦的乳脂率、产奶量和体细胞评分数据集。模拟结果表明，除Bayes R外，其他方法对QTL数量和QTL效应分布的变化较为敏感。有一个QTL效应的分布，类似于不同的贝叶斯方法估计标记效应的假设，并没有提高他们的预测精度。与其他方法相比，贝叶斯B方法给出了更高或相同的精度。实际数据分析表明，与模拟中qtl数量较多的情况类似，不同方法对任意性状的准确率均无差异。

{"title":"Accuracy and sensitivity of different Bayesian methods for genomic prediction using simulation and real data.","authors":"Saheb Foroutaifar","doi":"10.1515/sagmb-2019-0007","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0007","url":null,"abstract":"The main objectives of this study were to compare the prediction accuracy of different Bayesian methods for traits with a wide range of genetic architecture using simulation and real data and to assess the sensitivity of these methods to the violation of their assumptions. For the simulation study, different scenarios were implemented based on two traits with low or high heritability and different numbers of QTL and the distribution of their effects. For real data analysis, a German Holstein dataset for milk fat percentage, milk yield, and somatic cell score was used. The simulation results showed that, with the exception of the Bayes R, the other methods were sensitive to changes in the number of QTLs and distribution of QTL effects. Having a distribution of QTL effects, similar to what different Bayesian methods assume for estimating marker effects, did not improve their prediction accuracy. The Bayes B method gave higher or equal accuracy rather than the rest. The real data analysis showed that similar to scenarios with a large number of QTLs in the simulation, there was no difference between the accuracies of the different methods for any of the traits.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"19 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0007","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38247369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Understanding hormonal crosstalk in Arabidopsis root development via emulation and history matching. 通过模拟和历史匹配了解拟南芥根系发育中的激素串扰。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-07-13 DOI: 10.1515/sagmb-2018-0053

Samuel E Jackson, Ian Vernon, Junli Liu, Keith Lindsey

A major challenge in plant developmental biology is to understand how plant growth is coordinated by interacting hormones and genes. To meet this challenge, it is important to not only use experimental data, but also formulate a mathematical model. For the mathematical model to best describe the true biological system, it is necessary to understand the parameter space of the model, along with the links between the model, the parameter space and experimental observations. We develop sequential history matching methodology, using Bayesian emulation, to gain substantial insight into biological model parameter spaces. This is achieved by finding sets of acceptable parameters in accordance with successive sets of physical observations. These methods are then applied to a complex hormonal crosstalk model for Arabidopsis root growth. In this application, we demonstrate how an initial set of 22 observed trends reduce the volume of the set of acceptable inputs to a proportion of 6.1 × 10-7 of the original space. Additional sets of biologically relevant experimental data, each of size 5, reduce the size of this space by a further three and two orders of magnitude respectively. Hence, we provide insight into the constraints placed upon the model structure by, and the biological consequences of, measuring subsets of observations.

植物发育生物学的一个主要挑战是了解植物生长是如何通过激素和基因的相互作用来协调的。为了应对这一挑战，不仅要使用实验数据，还要建立数学模型。为了使数学模型最好地描述真实的生物系统，有必要了解模型的参数空间，以及模型、参数空间和实验观测之间的联系。我们开发时序历史匹配方法，使用贝叶斯仿真，以获得对生物模型参数空间的实质性见解。这是通过根据连续的物理观测找到一组可接受的参数来实现的。然后将这些方法应用于拟南芥根系生长的复杂激素串扰模型。在这个应用程序中，我们演示了22个观察到的趋势的初始集如何将可接受输入集的体积减少到原始空间的6.1 × 10-7的比例。额外的生物学相关实验数据集，每个大小为5，分别将该空间的大小进一步减少了3个数量级和2个数量级。因此，我们提供了对模型结构的约束的见解，以及测量观测子集的生物学后果。

{"title":"Understanding hormonal crosstalk in Arabidopsis root development via emulation and history matching.","authors":"Samuel E Jackson, Ian Vernon, Junli Liu, Keith Lindsey","doi":"10.1515/sagmb-2018-0053","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0053","url":null,"abstract":"A major challenge in plant developmental biology is to understand how plant growth is coordinated by interacting hormones and genes. To meet this challenge, it is important to not only use experimental data, but also formulate a mathematical model. For the mathematical model to best describe the true biological system, it is necessary to understand the parameter space of the model, along with the links between the model, the parameter space and experimental observations. We develop sequential history matching methodology, using Bayesian emulation, to gain substantial insight into biological model parameter spaces. This is achieved by finding sets of acceptable parameters in accordance with successive sets of physical observations. These methods are then applied to a complex hormonal crosstalk model for Arabidopsis root growth. In this application, we demonstrate how an initial set of 22 observed trends reduce the volume of the set of acceptable inputs to a proportion of 6.1 × 10-7 of the original space. Additional sets of biologically relevant experimental data, each of size 5, reduce the size of this space by a further three and two orders of magnitude respectively. Hence, we provide insight into the constraints placed upon the model structure by, and the biological consequences of, measuring subsets of observations.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"19 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0053","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38140980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Bivariate traits association analysis using generalized estimating equations in family data. 基于广义估计方程的家庭数据双变量性状关联分析。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-05-05 DOI: 10.1515/sagmb-2019-0030

Mariza de Andrade, Mauricio A Mazo Lopera, Nubia E Duarte

Genome wide association study (GWAS) is becoming fundamental in the arduous task of deciphering the etiology of complex diseases. The majority of the statistical models used to address the genes-disease association consider a single response variable. However, it is common for certain diseases to have correlated phenotypes such as in cardiovascular diseases. Usually, GWAS typically sample unrelated individuals from a population and the shared familial risk factors are not investigated. In this paper, we propose to apply a bivariate model using family data that associates two phenotypes with a genetic region. Using generalized estimation equations (GEE), we model two phenotypes, either discrete, continuous or a mixture of them, as a function of genetic variables and other important covariates. We incorporate the kinship relationships into the working matrix extended to a bivariate analysis. The estimation method and the joint gene-set effect in both phenotypes are developed in this work. We also evaluate the proposed methodology with a simulation study and an application to real data.

基因组全关联研究(GWAS)正在成为破译复杂疾病病因的艰巨任务的基础。大多数用于研究基因-疾病关联的统计模型都考虑一个单一的反应变量。然而，某些疾病通常具有相关的表型，例如心血管疾病。通常，GWAS通常从人群中抽样不相关的个体，而不调查共同的家族危险因素。在本文中，我们建议应用一个双变量模型，使用家庭数据，将两种表型与遗传区域联系起来。使用广义估计方程(GEE)，我们将两种表型(离散型、连续型或混合型)作为遗传变量和其他重要协变量的函数进行建模。我们将亲属关系纳入工作矩阵扩展到双变量分析。在这项工作中，开发了两种表型的估计方法和联合基因集效应。我们还通过模拟研究和实际数据的应用来评估所提出的方法。

{"title":"Bivariate traits association analysis using generalized estimating equations in family data.","authors":"Mariza de Andrade, Mauricio A Mazo Lopera, Nubia E Duarte","doi":"10.1515/sagmb-2019-0030","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0030","url":null,"abstract":"Genome wide association study (GWAS) is becoming fundamental in the arduous task of deciphering the etiology of complex diseases. The majority of the statistical models used to address the genes-disease association consider a single response variable. However, it is common for certain diseases to have correlated phenotypes such as in cardiovascular diseases. Usually, GWAS typically sample unrelated individuals from a population and the shared familial risk factors are not investigated. In this paper, we propose to apply a bivariate model using family data that associates two phenotypes with a genetic region. Using generalized estimation equations (GEE), we model two phenotypes, either discrete, continuous or a mixture of them, as a function of genetic variables and other important covariates. We incorporate the kinship relationships into the working matrix extended to a bivariate analysis. The estimation method and the joint gene-set effect in both phenotypes are developed in this work. We also evaluate the proposed methodology with a simulation study and an application to real data.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"19 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37905663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian approach to discriminant problems for count data with application to multilocus short tandem repeat dataset. 计数数据判别问题的贝叶斯方法及其在多位点短串联重复数据集上的应用。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-05-04 DOI: 10.1515/sagmb-2018-0044

Koji Tsukuda, Shuhei Mano, Toshimichi Yamamoto

Short Tandem Repeats (STRs) are a type of DNA polymorphism. This study considers discriminant analysis to determine the population of test individuals using an STR database containing the lengths of STRs observed at more than one locus. The discriminant method based on the Bayes factor is discussed and an improved method is proposed. The main issues are to develop a method that is relatively robust to sample size imbalance, identify a procedure to select loci, and treat the parameter in the prior distribution. A previous study achieved a classification accuracy of 0.748 for the g-mean (geometric mean of classification accuracies for two populations) and 0.867 for the AUC (area under the receiver operating characteristic curve). We improve the maximum values for the g-mean to 0.830 and the AUC to 0.935. Computer simulations indicate that the previous method is susceptible to sample size imbalance, whereas the proposed method is more robust while achieving almost identical classification accuracy. Furthermore, the results confirm that threshold adjustment is an effective countermeasure to sample size imbalance.

短串联重复序列(STRs)是一种DNA多态性。本研究采用判别分析来确定测试个体的总体，使用包含在多个位点观察到的STR长度的STR数据库。讨论了基于贝叶斯因子的判别方法，提出了一种改进方法。主要问题是开发一种对样本量不平衡具有相对鲁棒性的方法，确定一个选择位点的程序，并处理先验分布中的参数。先前的研究中，g-mean(两个种群分类精度的几何平均值)和AUC(接收者工作特征曲线下面积)的分类精度分别为0.748和0.867。我们将g均值的最大值提高到0.830,AUC提高到0.935。计算机模拟表明，之前的方法容易受到样本量不平衡的影响，而提出的方法在获得几乎相同的分类精度的同时具有更强的鲁棒性。进一步验证了阈值调整是解决样本数量失衡的有效对策。

{"title":"Bayesian approach to discriminant problems for count data with application to multilocus short tandem repeat dataset.","authors":"Koji Tsukuda, Shuhei Mano, Toshimichi Yamamoto","doi":"10.1515/sagmb-2018-0044","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0044","url":null,"abstract":"Short Tandem Repeats (STRs) are a type of DNA polymorphism. This study considers discriminant analysis to determine the population of test individuals using an STR database containing the lengths of STRs observed at more than one locus. The discriminant method based on the Bayes factor is discussed and an improved method is proposed. The main issues are to develop a method that is relatively robust to sample size imbalance, identify a procedure to select loci, and treat the parameter in the prior distribution. A previous study achieved a classification accuracy of 0.748 for the g-mean (geometric mean of classification accuracies for two populations) and 0.867 for the AUC (area under the receiver operating characteristic curve). We improve the maximum values for the g-mean to 0.830 and the AUC to 0.935. Computer simulations indicate that the previous method is susceptible to sample size imbalance, whereas the proposed method is more robust while achieving almost identical classification accuracy. Furthermore, the results confirm that threshold adjustment is an effective countermeasure to sample size imbalance.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"19 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37896963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identification of supervised and sparse functional genomic pathways. 有监督和稀疏功能基因组通路的鉴定。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-02-29 DOI: 10.1515/sagmb-2018-0026

Fan Zhang, Jeffrey C Miecznikowski, David L Tritchler

Functional pathways involve a series of biological alterations that may result in the occurrence of many diseases including cancer. With the availability of various "omics" technologies it becomes feasible to integrate information from a hierarchy of biological layers to provide a more comprehensive understanding to the disease. In many diseases, it is believed that only a small number of networks, each relatively small in size, drive the disease. Our goal in this study is to develop methods to discover these functional networks across biological layers correlated with the phenotype. We derive a novel Network Summary Matrix (NSM) that highlights potential pathways conforming to least squares regression relationships. An algorithm called Decomposition of Network Summary Matrix via Instability (DNSMI) involving decomposition of NSM using instability regularization is proposed. Simulations and real data analysis from The Cancer Genome Atlas (TCGA) program will be shown to demonstrate the performance of the algorithm.

功能途径涉及一系列可能导致包括癌症在内的许多疾病发生的生物学改变。随着各种“组学”技术的可用性，整合来自生物层次的信息以提供对疾病更全面的了解变得可行。在许多疾病中，人们认为只有少数网络(每个网络的规模相对较小)驱动疾病。我们在这项研究中的目标是开发方法来发现这些跨生物层与表型相关的功能网络。我们推导了一个新颖的网络总结矩阵(NSM)，突出了符合最小二乘回归关系的潜在路径。提出了一种基于不稳定的网络汇总矩阵分解(DNSMI)算法，该算法涉及到使用不稳定正则化方法对NSM进行分解。通过癌症基因组图谱(TCGA)程序的仿真和真实数据分析，验证了该算法的性能。

引用次数: 4

Joint variable selection and network modeling for detecting eQTLs. eqtl检测的联合变量选择与网络建模。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-02-20 DOI: 10.1515/sagmb-2019-0032

Xuan Cao, Lili Ding, Tesfaye B Mersha

In this study, we conduct a comparison of three most recent statistical methods for joint variable selection and covariance estimation with application of detecting expression quantitative trait loci (eQTL) and gene network estimation, and introduce a new hierarchical Bayesian method to be included in the comparison. Unlike the traditional univariate regression approach in eQTL, all four methods correlate phenotypes and genotypes by multivariate regression models that incorporate the dependence information among phenotypes, and use Bayesian multiplicity adjustment to avoid multiple testing burdens raised by traditional multiple testing correction methods. We presented the performance of three methods (MSSL - Multivariate Spike and Slab Lasso, SSUR - Sparse Seemingly Unrelated Bayesian Regression, and OBFBF - Objective Bayes Fractional Bayes Factor), along with the proposed, JDAG (Joint estimation via a Gaussian Directed Acyclic Graph model) method through simulation experiments, and publicly available HapMap real data, taking asthma as an example. Compared with existing methods, JDAG identified networks with higher sensitivity and specificity under row-wise sparse settings. JDAG requires less execution in small-to-moderate dimensions, but is not currently applicable to high dimensional data. The eQTL analysis in asthma data showed a number of known gene regulations such as STARD3, IKZF3 and PGAP3, all reported in asthma studies. The code of the proposed method is freely available at GitHub (https://github.com/xuan-cao/Joint-estimation-for-eQTL).

在本研究中，我们比较了三种最新的联合变量选择和协方差估计统计方法与检测表达数量性状位点(eQTL)和基因网络估计的应用，并引入了一种新的分层贝叶斯方法进行比较。与传统的单变量回归方法不同，这四种方法均通过纳入表型间依赖信息的多变量回归模型将表型和基因型关联起来，并使用贝叶斯多重性调整来避免传统多重检验校正方法带来的多重检验负担。我们介绍了三种方法(MSSL -多元Spike and Slab Lasso, SSUR -稀疏看似无关贝叶斯回归，OBFBF -客观贝叶斯分数阶贝叶斯因子)的性能，以及通过仿真实验提出的JDAG(基于高斯有向无环图模型的联合估计)方法，以及公开的HapMap真实数据，以哮喘为例。与现有方法相比，JDAG在逐行稀疏设置下识别网络具有更高的灵敏度和特异性。JDAG在小维度到中等维度上需要较少的执行，但目前不适用于高维数据。哮喘数据中的eQTL分析显示了许多已知的基因调控，如STARD3、IKZF3和PGAP3，均在哮喘研究中报道。建议的方法的代码可以在GitHub (https://github.com/xuan-cao/Joint-estimation-for-eQTL)上免费获得。

{"title":"Joint variable selection and network modeling for detecting eQTLs.","authors":"Xuan Cao, Lili Ding, Tesfaye B Mersha","doi":"10.1515/sagmb-2019-0032","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0032","url":null,"abstract":"In this study, we conduct a comparison of three most recent statistical methods for joint variable selection and covariance estimation with application of detecting expression quantitative trait loci (eQTL) and gene network estimation, and introduce a new hierarchical Bayesian method to be included in the comparison. Unlike the traditional univariate regression approach in eQTL, all four methods correlate phenotypes and genotypes by multivariate regression models that incorporate the dependence information among phenotypes, and use Bayesian multiplicity adjustment to avoid multiple testing burdens raised by traditional multiple testing correction methods. We presented the performance of three methods (MSSL - Multivariate Spike and Slab Lasso, SSUR - Sparse Seemingly Unrelated Bayesian Regression, and OBFBF - Objective Bayes Fractional Bayes Factor), along with the proposed, JDAG (Joint estimation via a Gaussian Directed Acyclic Graph model) method through simulation experiments, and publicly available HapMap real data, taking asthma as an example. Compared with existing methods, JDAG identified networks with higher sensitivity and specificity under row-wise sparse settings. JDAG requires less execution in small-to-moderate dimensions, but is not currently applicable to high dimensional data. The eQTL analysis in asthma data showed a number of known gene regulations such as STARD3, IKZF3 and PGAP3, all reported in asthma studies. The code of the proposed method is freely available at GitHub (https://github.com/xuan-cao/Joint-estimation-for-eQTL).","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"19 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37660750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An extended model for phylogenetic maximum likelihood based on discrete morphological characters. 基于离散形态特征的系统发育最大似然扩展模型。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-02-20 DOI: 10.1515/sagmb-2019-0029

David A Spade

Maximum likelihood is a common method of estimating a phylogenetic tree based on a set of genetic data. However, models of evolution for certain types of genetic data are highly flawed in their specification, and this misspecification can have an adverse impact on phylogenetic inference. Our attention here is focused on extending an existing class of models for estimating phylogenetic trees from discrete morphological characters. The main advance of this work is a model that allows unequal equilibrium frequencies in the estimation of phylogenetic trees from discrete morphological character data using likelihood methods. Possible extensions of the proposed model will also be discussed.

最大似然是一种基于一组遗传数据估计系统发育树的常用方法。然而，某些类型的遗传数据的进化模型在其规范中存在严重缺陷，这种错误的规范可能对系统发育推断产生不利影响。我们的注意力集中在扩展现有的一类模型，用于从离散形态特征估计系统发育树。这项工作的主要进展是一个模型，该模型允许使用似然方法从离散形态特征数据估计系统发育树的不相等平衡频率。还将讨论拟议模型的可能扩展。

引用次数: 1

Sparse latent factor regression models for genome-wide and epigenome-wide association studies 全基因组和表观全基因组关联研究的稀疏潜在因子回归模型

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-02-07 DOI: 10.1101/2020.02.07.938381

B. Jumentier, Kévin Caye, B. Heude, J. Lepeule, O. François

Abstract Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.

表型或暴露与基因组和表观基因组数据的关联面临着重要的统计学挑战。其中一个挑战是解释由于未观察到的混杂因素引起的变异，例如个体祖先或组织中的细胞类型组成。这个问题可以通过惩罚潜在因素回归模型来解决，其中引入惩罚来处理数据中的高维。如果相对较小比例的基因组或表观基因组标记与感兴趣的变量相关，稀疏度惩罚可能有助于捕获相关关联，但非稀疏方法的改进尚未得到充分评估。在这里，我们提出了最小二乘算法，联合估计稀疏潜在因素回归模型中的效应大小和混杂因素。在模拟数据中，稀疏潜因子回归模型通常比其他稀疏方法具有更高的统计性能，包括最小绝对收缩和选择算子以及贝叶斯稀疏线性混合模型。在生成模型模拟中，统计性能略低于非稀疏方法(但与之相当)，但在基于经验数据的模拟中，稀疏潜在因素回归模型比非稀疏方法对偏离模型的鲁棒性更强。我们将稀疏潜在因子回归模型应用于拟南芥开花性状的全基因组关联研究和孕妇吸烟状况的全基因组关联研究。对于这两种应用，稀疏潜在因素回归模型有助于估计非零效应大小，同时克服了多个测试问题。结果不仅与先前的发现一致，而且他们还确定了与每种应用相关的功能注释的新基因。

{"title":"Sparse latent factor regression models for genome-wide and epigenome-wide association studies","authors":"B. Jumentier, Kévin Caye, B. Heude, J. Lepeule, O. François","doi":"10.1101/2020.02.07.938381","DOIUrl":"https://doi.org/10.1101/2020.02.07.938381","url":null,"abstract":"Abstract Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"21 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46182011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics. Dirichlet过程混合物中变量选择的快速近似推断，并在泛癌症蛋白质组学中的应用。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-12-12 DOI: 10.1515/sagmb-2018-0065

Oliver M Crook, Laurent Gatto, Paul D W Kirk

The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel.

Dirichlet过程（DP）混合模型已成为基于模型的聚类的一种流行选择，主要是因为它允许推断聚类的数量。顺序更新和贪婪搜索（SUGS）算法（Wang & Dunson, 2011）被提出作为一种快速的方法，在DP混合模型中执行近似贝叶斯推理，通过将聚类作为贝叶斯模型选择（BMS）问题，避免使用计算代价高昂的马尔可夫链蒙特卡罗方法。在这里，我们考虑如何将这种方法扩展到允许聚类的变量选择，并演示贝叶斯模型平均（BMA）代替BMS的好处。通过一系列模拟示例和来自癌症转录组学的充分研究示例，我们表明我们的方法与当前最先进的方法相比具有竞争力，同时也提供了计算优势。我们将我们的方法应用于来自癌症基因组图谱（TCGA）的反相蛋白质阵列（RPPA）数据，以便对5157个肿瘤样本进行泛癌症蛋白质组学表征。我们已经在一个名为sugsvarsel的开源R包中实现了我们的方法，以及原始的SUGS算法，该包通过在c++中执行密集计算来加速分析，并提供自动并行处理。R包可以从https://github.com/ococrook/sugsvarsel免费获得。

{"title":"Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics.","authors":"Oliver M Crook, Laurent Gatto, Paul D W Kirk","doi":"10.1515/sagmb-2018-0065","DOIUrl":"10.1515/sagmb-2018-0065","url":null,"abstract":"The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7614016/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10481523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0