Statistical Applications in Genetics and Molecular Biology最新文献

When is the allele-sharing dissimilarity between two populations exceeded by the allele-sharing dissimilarity of a population with itself? 当两个种群之间的等位基因共享相似性超过一个种群与自身的等位基因共享相似性时？

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-12-11 DOI: 10.1515/sagmb-2023-0004

Xiran Liu, Zarif Ahsan, Tarun K. Martheswaran, Noah A. Rosenberg

Allele-sharing statistics for a genetic locus measure the dissimilarity between two populations as a mean of the dissimilarity between random pairs of individuals, one from each population. Owing to within-population variation in genotype, allele-sharing dissimilarities can have the property that they have a nonzero value when computed between a population and itself. We consider the mathematical properties of allele-sharing dissimilarities in a pair of populations, treating the allele frequencies in the two populations parametrically. Examining two formulations of allele-sharing dissimilarity, we obtain the distributions of within-population and between-population dissimilarities for pairs of individuals. We then mathematically explore the scenarios in which, for certain allele-frequency distributions, the within-population dissimilarity – the mean dissimilarity between randomly chosen members of a population – can exceed the dissimilarity between two populations. Such scenarios assist in explaining observations in population-genetic data that members of a population can be empirically more genetically dissimilar from each other on average than they are from members of another population. For a population pair, however, the mathematical analysis finds that at least one of the two populations always possesses smaller within-population dissimilarity than the value of the between-population dissimilarity. We illustrate the mathematical results with an application to human population-genetic data.

遗传位点的等位基因共享统计量是将两个种群之间的差异作为随机配对个体之间差异的平均值来衡量的，每个种群中随机配对一个个体。由于种群内基因型的差异，等位基因共享相似度在计算种群与种群之间的相似度时，可能具有非零值的特性。我们考虑了一对种群中等位基因共享异质性的数学特性，并对两个种群中的等位基因频率进行了参数化处理。通过研究等位基因共享相似性的两种公式，我们得到了一对个体的种群内和种群间相似性的分布。然后，我们用数学方法探讨了在某些等位基因频率分布的情况下，种群内相似度（随机选择的种群成员之间的平均相似度）可能超过两个种群之间相似度的情形。这种情况有助于解释种群遗传数据中的观察结果，即一个种群成员之间的平均遗传差异可能大于另一个种群成员之间的平均差异。然而，数学分析发现，对于一对种群来说，两个种群中至少有一个种群的种群内相似性总是小于种群间相似性的值。我们将应用人类种群遗传数据来说明数学结果。

{"title":"When is the allele-sharing dissimilarity between two populations exceeded by the allele-sharing dissimilarity of a population with itself?","authors":"Xiran Liu, Zarif Ahsan, Tarun K. Martheswaran, Noah A. Rosenberg","doi":"10.1515/sagmb-2023-0004","DOIUrl":"https://doi.org/10.1515/sagmb-2023-0004","url":null,"abstract":"Allele-sharing statistics for a genetic locus measure the dissimilarity between two populations as a mean of the dissimilarity between random pairs of individuals, one from each population. Owing to within-population variation in genotype, allele-sharing dissimilarities can have the property that they have a nonzero value when computed between a population and itself. We consider the mathematical properties of allele-sharing dissimilarities in a pair of populations, treating the allele frequencies in the two populations parametrically. Examining two formulations of allele-sharing dissimilarity, we obtain the distributions of within-population and between-population dissimilarities for pairs of individuals. We then mathematically explore the scenarios in which, for certain allele-frequency distributions, the within-population dissimilarity – the mean dissimilarity between randomly chosen members of a population – can exceed the dissimilarity between two populations. Such scenarios assist in explaining observations in population-genetic data that members of a population can be empirically more genetically dissimilar from each other on average than they are from members of another population. For a population pair, however, the mathematical analysis finds that at least one of the two populations always possesses smaller within-population dissimilarity than the value of the between-population dissimilarity. We illustrate the mathematical results with an application to human population-genetic data.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138572820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparse latent factor regression models for genome-wide and epigenome-wide association studies 全基因组和表观全基因组关联研究的稀疏潜在因子回归模型

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2022-01-01 DOI: 10.1515/sagmb-2021-0035

Basile Jumentier, Kevin Caye, Barbara Heude, Johanna Lepeule, Olivier François

Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.

表型或暴露与基因组和表观基因组数据的关联面临着重要的统计挑战。其中一个挑战是解释由于未观察到的混杂因素引起的变异，例如个体祖先或组织中的细胞类型组成。这个问题可以通过惩罚潜在因素回归模型来解决，其中引入惩罚来处理数据中的高维。如果相对较小比例的基因组或表观基因组标记与感兴趣的变量相关，稀疏度惩罚可能有助于捕获相关关联，但非稀疏方法的改进尚未得到充分评估。在这里，我们提出了最小二乘算法，联合估计稀疏潜在因素回归模型中的效应大小和混杂因素。在模拟数据中，稀疏潜因子回归模型通常比其他稀疏方法具有更高的统计性能，包括最小绝对收缩和选择算子以及贝叶斯稀疏线性混合模型。在生成模型模拟中，统计性能略低于非稀疏方法(但与之相当)，但在基于经验数据的模拟中，稀疏潜在因素回归模型比非稀疏方法对偏离模型的鲁棒性更强。我们将稀疏潜在因子回归模型应用于拟南芥开花性状的全基因组关联研究和孕妇吸烟状况的全基因组关联研究。对于这两种应用，稀疏潜在因素回归模型有助于估计非零效应大小，同时克服了多个测试问题。结果不仅与先前的发现一致，而且他们还确定了与每种应用相关的功能注释的新基因。

{"title":"Sparse latent factor regression models for genome-wide and epigenome-wide association studies","authors":"Basile Jumentier, Kevin Caye, Barbara Heude, Johanna Lepeule, Olivier François","doi":"10.1515/sagmb-2021-0035","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0035","url":null,"abstract":"Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"4 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138528299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low variability in the underlying cellular landscape adversely affects the performance of interaction-based approaches for conducting cell-specific analyses of DNA methylation in bulk samples. 潜在细胞景观的低变异性对基于相互作用的方法在大量样本中进行DNA甲基化的细胞特异性分析的性能产生了不利影响。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2021-08-10 DOI: 10.1515/sagmb-2021-0004

Richard Meier, Emily Nissen, Devin C Koestler

Statistical methods that allow for cell type specific DNA methylation (DNAm) analyses based on bulk-tissue methylation data have great potential to improve our understanding of human disease and have created unprecedented opportunities for new insights using the wealth of publicly available bulk-tissue methylation data. These methodologies involve incorporating interaction terms formed between the phenotypes/exposures of interest and proportions of the cell types underlying the bulk-tissue sample used for DNAm profiling. Despite growing interest in such "interaction-based" methods, there has been no comprehensive assessment how variability in the cellular landscape across study samples affects their performance. To answer this question, we used numerous publicly available whole-blood DNAm data sets along with extensive simulation studies and evaluated the performance of interaction-based approaches in detecting cell-specific methylation effects. Our results show that low cell proportion variability results in large estimation error and low statistical power for detecting cell-specific effects of DNAm. Further, we identified that many studies targeting methylation profiling in whole-blood may be at risk to be underpowered due to low variability in the cellular landscape across study samples. Finally, we discuss guidelines for researchers seeking to conduct studies utilizing interaction-based approaches to help ensure that their studies are adequately powered.

基于大量组织甲基化数据进行细胞类型特异性DNA甲基化（DNAm）分析的统计方法在提高我们对人类疾病的理解方面具有巨大潜力，并利用丰富的公开组织甲基化信息为新的见解创造了前所未有的机会。这些方法涉及结合在感兴趣的表型/暴露与用于DNAm分析的大块组织样品下的细胞类型比例之间形成的相互作用项。尽管人们对这种“基于交互”的方法越来越感兴趣，但还没有全面评估研究样本中细胞景观的可变性如何影响其性能。为了回答这个问题，我们使用了大量公开的全血DNAm数据集以及广泛的模拟研究，并评估了基于相互作用的方法在检测细胞特异性甲基化效应方面的性能。我们的结果表明，低细胞比例变异性导致检测DNAm的细胞特异性效应的大的估计误差和低的统计能力。此外，我们发现，由于研究样本中细胞景观的低变异性，许多针对全血甲基化谱的研究可能有动力不足的风险。最后，我们讨论了寻求利用基于互动的方法进行研究的研究人员的指导方针，以帮助确保他们的研究得到充分的支持。

{"title":"Low variability in the underlying cellular landscape adversely affects the performance of interaction-based approaches for conducting cell-specific analyses of DNA methylation in bulk samples.","authors":"Richard Meier, Emily Nissen, Devin C Koestler","doi":"10.1515/sagmb-2021-0004","DOIUrl":"10.1515/sagmb-2021-0004","url":null,"abstract":"Statistical methods that allow for cell type specific DNA methylation (DNAm) analyses based on bulk-tissue methylation data have great potential to improve our understanding of human disease and have created unprecedented opportunities for new insights using the wealth of publicly available bulk-tissue methylation data. These methodologies involve incorporating interaction terms formed between the phenotypes/exposures of interest and proportions of the cell types underlying the bulk-tissue sample used for DNAm profiling. Despite growing interest in such \"interaction-based\" methods, there has been no comprehensive assessment how variability in the cellular landscape across study samples affects their performance. To answer this question, we used numerous publicly available whole-blood DNAm data sets along with extensive simulation studies and evaluated the performance of interaction-based approaches in detecting cell-specific methylation effects. Our results show that low cell proportion variability results in large estimation error and low statistical power for detecting cell-specific effects of DNAm. Further, we identified that many studies targeting methylation profiling in whole-blood may be at risk to be underpowered due to low variability in the cellular landscape across study samples. Finally, we discuss guidelines for researchers seeking to conduct studies utilizing interaction-based approaches to help ensure that their studies are adequately powered.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"20 3","pages":"73-84"},"PeriodicalIF":0.9,"publicationDate":"2021-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9125800/pdf/sagmb-20-3-sagmb-2021-0004.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39300900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AdaReg: data adaptive robust estimation in linear regression with application in GTEx gene expressions. AdaReg:线性回归中数据自适应稳健估计及其在GTEx基因表达中的应用。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2021-07-13 DOI: 10.1515/sagmb-2020-0042

Meng Wang, Lihua Jiang, Michael P Snyder

The Genotype-Tissue Expression (GTEx) project provides a valuable resource of large-scale gene expressions across multiple tissue types. Under various technical noise and unknown or unmeasured factors, how to robustly estimate the major tissue effect becomes challenging. Moreover, different genes exhibit heterogeneous expressions across different tissue types. Therefore, we need a robust method which adapts to the heterogeneities of gene expressions to improve the estimation for the tissue effect. We followed the approach of the robust estimation based on γ-density-power-weight in the works of Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053-2081 and Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599-609, where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture models. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ's in average performance has similar capability to capture minimizer γ as the inestimable mean squared error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) showed a significant advantage in both simulation studies and real data application in estimating tissue effect of heart samples from the GTEx project, compared to the fixed γ procedure and other robust methods. At the end, the paper discussed some limitations on this method and future work.

基因型-组织表达(GTEx)项目提供了跨多种组织类型的大规模基因表达的宝贵资源。在各种技术噪声和未知或不可测因素的影响下，如何稳健地估计主要组织效应成为一个挑战。此外，不同的基因在不同的组织类型中表现出异质表达。因此，我们需要一种适应基因表达异质性的鲁棒性方法来提高对组织效应的估计。我们采用了Fujisawa, H.和Eguchi, S.(2008)的基于γ-密度-功率权值的稳健估计方法。对重污染具有小偏差的鲁棒参数估计。[j] .地理科学与管理，1999(1):1 - 3。鲁棒模型拟合。j·罗伊。统计,Soc。B: 599-609，其中γ是控制偏差和方差之间平衡的密度权重的指数。据我们所知，我们的工作是第一个提出一个过程来调整参数γ，以平衡混合模型下的偏差-方差权衡。在高斯总体分布与未知离群分布混合的混合模型中，构建了基于加权密度的稳健似然准则，并开发了嵌入稳健估计的数据自适应γ选择程序。我们对选择准则进行了启发式分析，发现我们在各种平均性能γ下的实际选择趋势与我们在一系列设置下的模拟研究中不可估计的均方误差(MSE)趋势具有相似的捕获最小化γ的能力。与固定γ方法和其他鲁棒方法相比，我们在线性回归问题(AdaReg)中的数据自适应鲁棒化方法在GTEx项目中估计心脏样本组织效应的模拟研究和实际数据应用中都显示出显著的优势。最后，对该方法的局限性和今后的工作进行了讨论。

{"title":"AdaReg: data adaptive robust estimation in linear regression with application in GTEx gene expressions.","authors":"Meng Wang, Lihua Jiang, Michael P Snyder","doi":"10.1515/sagmb-2020-0042","DOIUrl":"https://doi.org/10.1515/sagmb-2020-0042","url":null,"abstract":"The Genotype-Tissue Expression (GTEx) project provides a valuable resource of large-scale gene expressions across multiple tissue types. Under various technical noise and unknown or unmeasured factors, how to robustly estimate the major tissue effect becomes challenging. Moreover, different genes exhibit heterogeneous expressions across different tissue types. Therefore, we need a robust method which adapts to the heterogeneities of gene expressions to improve the estimation for the tissue effect. We followed the approach of the robust estimation based on γ-density-power-weight in the works of Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053-2081 and Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599-609, where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture models. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ's in average performance has similar capability to capture minimizer γ as the inestimable mean squared error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) showed a significant advantage in both simulation studies and real data application in estimating tissue effect of heart samples from the GTEx project, compared to the fixed γ procedure and other robust methods. At the end, the paper discussed some limitations on this method and future work.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"20 2","pages":"51-71"},"PeriodicalIF":0.9,"publicationDate":"2021-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2020-0042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39177341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Collocation based training of neural ordinary differential equations. 基于配置的神经常微分方程训练。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2021-07-09 DOI: 10.1515/sagmb-2020-0025

Elisabeth Roesch, Christopher Rackauckas, Michael P H Stumpf

The predictive power of machine learning models often exceeds that of mechanistic modeling approaches. However, the interpretability of purely data-driven models, without any mechanistic basis is often complicated, and predictive power by itself can be a poor metric by which we might want to judge different methods. In this work, we focus on the relatively new modeling techniques of neural ordinary differential equations. We discuss how they relate to machine learning and mechanistic models, with the potential to narrow the gulf between these two frameworks: they constitute a class of hybrid model that integrates ideas from data-driven and dynamical systems approaches. Training neural ODEs as representations of dynamical systems data has its own specific demands, and we here propose a collocation scheme as a fast and efficient training strategy. This alleviates the need for costly ODE solvers. We illustrate the advantages that collocation approaches offer, as well as their robustness to qualitative features of a dynamical system, and the quantity and quality of observational data. We focus on systems that exemplify some of the hallmarks of complex dynamical systems encountered in systems biology, and we map out how these methods can be used in the analysis of mathematical models of cellular and physiological processes.

机器学习模型的预测能力往往超过机械建模方法。然而，没有任何机制基础的纯数据驱动模型的可解释性通常是复杂的，并且预测能力本身可能是一个糟糕的度量标准，我们可能希望通过它来判断不同的方法。在这项工作中，我们专注于相对较新的神经常微分方程建模技术。我们将讨论它们如何与机器学习和机械模型相关联，并有可能缩小这两个框架之间的鸿沟:它们构成了一类混合模型，集成了数据驱动和动态系统方法的思想。训练神经ode作为动态系统数据的表示有其特定的要求，本文提出了一种快速有效的训练策略。这减少了对昂贵的ODE求解器的需求。我们说明了搭配方法提供的优势，以及它们对动力系统定性特征的鲁棒性，以及观测数据的数量和质量。我们专注于系统生物学中遇到的复杂动力系统的一些典型特征的系统，并且我们绘制了如何将这些方法用于细胞和生理过程的数学模型分析。

{"title":"Collocation based training of neural ordinary differential equations.","authors":"Elisabeth Roesch, Christopher Rackauckas, Michael P H Stumpf","doi":"10.1515/sagmb-2020-0025","DOIUrl":"https://doi.org/10.1515/sagmb-2020-0025","url":null,"abstract":"The predictive power of machine learning models often exceeds that of mechanistic modeling approaches. However, the interpretability of purely data-driven models, without any mechanistic basis is often complicated, and predictive power by itself can be a poor metric by which we might want to judge different methods. In this work, we focus on the relatively new modeling techniques of neural ordinary differential equations. We discuss how they relate to machine learning and mechanistic models, with the potential to narrow the gulf between these two frameworks: they constitute a class of hybrid model that integrates ideas from data-driven and dynamical systems approaches. Training neural ODEs as representations of dynamical systems data has its own specific demands, and we here propose a collocation scheme as a fast and efficient training strategy. This alleviates the need for costly ODE solvers. We illustrate the advantages that collocation approaches offer, as well as their robustness to qualitative features of a dynamical system, and the quantity and quality of observational data. We focus on systems that exemplify some of the hallmarks of complex dynamical systems encountered in systems biology, and we map out how these methods can be used in the analysis of mathematical models of cellular and physiological processes.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"20 2","pages":"37-49"},"PeriodicalIF":0.9,"publicationDate":"2021-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2020-0025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39164853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Fine tuned exploration of evolutionary relationships within the protein universe. 对蛋白质宇宙中进化关系的精细探索。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2021-02-17 DOI: 10.1515/sagmb-2019-0039

Danilo Gullotto

In the regime of domain classifications, the protein universe unveils a discrete set of folds connected by hierarchical relationships. Instead, at sub-domain-size resolution and because of physical constraints not necessarily requiring evolution to shape polypeptide chains, networks of protein motifs depict a continuous view that lies beyond the extent of hierarchical classification schemes. A number of studies, however, suggest that universal sub-sequences could be the descendants of peptides emerged in an ancient pre-biotic world. Should this be the case, evolutionary signals retained by structurally conserved motifs, along with hierarchical features of ancient domains, could sew relationships among folds that diverged beyond the point where homology is discernable. In view of the aforementioned, this paper provides a rationale where a network with hierarchical and continuous levels of the protein space, together with sequence profiles that probe the extent of sequence similarity and contacting residues that capture the transition from pre-biotic to domain world, has been used to explore relationships between ancient folds. Statistics of detected signals have been reported. As a result, an example of an emergent sub-network that makes sense from an evolutionary perspective, where conserved signals retrieved from the assessed protein space have been co-opted, has been discussed.

在结构域分类制度中，蛋白质宇宙揭示了一组由层次关系连接的离散折叠。相反，在亚结构域大小的分辨率下，由于物理约束不一定需要进化来形成多肽链，蛋白质基序网络描绘了一个连续的视图，超出了等级分类方案的范围。然而，许多研究表明，普遍的亚序列可能是在一个古老的前生命世界中出现的肽的后代。如果是这样的话，那么由结构保守的基序所保留的进化信号，以及古代域的等级特征，可能会将褶皱之间的关系缝合起来，而这些褶皱之间的关系已经超出了同源性可识别的程度。鉴于上述情况，本文提供了一个理论基础，其中一个具有分层和连续水平的蛋白质空间网络，以及用于探测序列相似性程度的序列概况和捕获从生命前到域世界过渡的接触残基，已被用于探索古代褶皱之间的关系。已上报检测信号的统计信息。因此，本文讨论了一个从进化角度来看有意义的紧急子网络的例子，其中从评估的蛋白质空间检索到的保守信号已被增选。

{"title":"Fine tuned exploration of evolutionary relationships within the protein universe.","authors":"Danilo Gullotto","doi":"10.1515/sagmb-2019-0039","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0039","url":null,"abstract":"In the regime of domain classifications, the protein universe unveils a discrete set of folds connected by hierarchical relationships. Instead, at sub-domain-size resolution and because of physical constraints not necessarily requiring evolution to shape polypeptide chains, networks of protein motifs depict a continuous view that lies beyond the extent of hierarchical classification schemes. A number of studies, however, suggest that universal sub-sequences could be the descendants of peptides emerged in an ancient pre-biotic world. Should this be the case, evolutionary signals retained by structurally conserved motifs, along with hierarchical features of ancient domains, could sew relationships among folds that diverged beyond the point where homology is discernable. In view of the aforementioned, this paper provides a rationale where a network with hierarchical and continuous levels of the protein space, together with sequence profiles that probe the extent of sequence similarity and contacting residues that capture the transition from pre-biotic to domain world, has been used to explore relationships between ancient folds. Statistics of detected signals have been reported. As a result, an example of an emergent sub-network that makes sense from an evolutionary perspective, where conserved signals retrieved from the assessed protein space have been co-opted, has been discussed.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"20 1","pages":"17-36"},"PeriodicalIF":0.9,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25375885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data. 从Hi-C数据中鉴定远距离染色体相互作用的经验贝叶斯方法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2021-01-25 DOI: 10.1515/sagmb-2020-0026

Qi Zhang, Zheng Xu, Yutong Lai

Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the "true" interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).

近年来，Hi-C实验已成为研究三维基因组结构的热门方法。鉴定远距离染色体相互作用，即峰检测，对Hi-C数据分析至关重要。但由于Hi-C计数数据矩阵固有的高维性、稀疏性和过色散性，这仍然是一项具有挑战性的任务。我们提出了EBHiC，一种从Hi-C数据中检测峰的经验贝叶斯方法。所提出的框架通过明确地包括“真实”相互作用强度作为潜在变量，提供了灵活的过分散建模。为了实现所提出的峰值识别方法(通过经验贝叶斯检验)，我们使用平滑期望最大化算法估计观测计数的半参数总体分布，并基于零假设估计经验零。我们进行了大量的模拟来验证和评估我们提出的方法的性能，并将其应用于实际数据集。我们的研究结果表明，EBHiC在准确性、生物可解释性和跨生物重复的一致性方面可以识别出更好的峰。源代码可在Github (https://github.com/QiZhangStat/EBHiC)上获得。

{"title":"An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data.","authors":"Qi Zhang, Zheng Xu, Yutong Lai","doi":"10.1515/sagmb-2020-0026","DOIUrl":"https://doi.org/10.1515/sagmb-2020-0026","url":null,"abstract":"Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the \"true\" interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"20 1","pages":"1-15"},"PeriodicalIF":0.9,"publicationDate":"2021-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2020-0026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25336730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Combining dependent p-values by gamma distributions. 通过伽马分布合并从属 P 值。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-11-06 DOI: 10.1515/sagmb-2019-0057

Li-Chu Chien

Combining correlated p-values from multiple hypothesis testing is a most frequently used method for integrating information in genetic and genomic data analysis. However, most existing methods for combining independent p-values from individual component problems into a single unified p-value are unsuitable for the correlational structure among p-values from multiple hypothesis testing. Although some existing p-value combination methods had been modified to overcome the potential limitations, there is no uniformly most powerful method for combining correlated p-values in genetic data analysis. Therefore, providing a p-value combination method that can robustly control type I errors and keep the good power rates is necessary. In this paper, we propose an empirical method based on the gamma distribution (EMGD) for combining dependent p-values from multiple hypothesis testing. The proposed test, EMGD, allows for flexible accommodating the highly correlated p-values from the multiple hypothesis testing into a unified p-value for examining the combined hypothesis that we are interested in. The EMGD retains the robustness character of the empirical Brown's method (EBM) for pooling the dependent p-values from multiple hypothesis testing. Moreover, the EMGD keeps the character of the method based on the gamma distribution that simultaneously retains the advantages of the z-transform test and the gamma-transform test for combining dependent p-values from multiple statistical tests. The two characters lead to the EMGD that can keep the robust power for combining dependent p-values from multiple hypothesis testing. The performance of the proposed method EMGD is illustrated with simulations and real data applications by comparing with the existing methods, such as Kost and McDermott's method, the EBM and the harmonic mean p-value method.

结合多个假设检验的相关 p 值是遗传和基因组数据分析中最常用的信息整合方法。然而，大多数现有的将来自单个成分问题的独立 p 值合并为一个统一 p 值的方法都不适合多重假设检验中 p 值之间的相关结构。虽然对现有的一些 p 值组合方法进行了修改，以克服潜在的局限性，但在遗传数据分析中，还没有统一的最强大的相关 p 值组合方法。因此，有必要提供一种既能稳健地控制 I 型误差又能保持良好幂率的 p 值组合方法。在本文中，我们提出了一种基于伽马分布（EMGD）的经验方法，用于组合多重假设检验的相关 p 值。所提出的 EMGD 检验可以灵活地将多重假设检验中高度相关的 p 值合并成一个统一的 p 值，用于检验我们感兴趣的组合假设。EMGD 保留了经验布朗法（EBM）的稳健性，可用于汇集多重假设检验的依存 p 值。此外，EMGD 还保留了基于伽马分布的方法的特点，即同时保留了 z 变换检验和伽马变换检验的优点，用于合并来自多个统计检验的从属 p 值。这两种特性使得 EMGD 在组合多重假设检验的因变量 p 值时能保持稳健的功率。通过与 Kost 和 McDermott 方法、EBM 和调和均值 p 值方法等现有方法的比较，模拟和实际数据应用说明了所提出的 EMGD 方法的性能。

{"title":"Combining dependent p-values by gamma distributions.","authors":"Li-Chu Chien","doi":"10.1515/sagmb-2019-0057","DOIUrl":"10.1515/sagmb-2019-0057","url":null,"abstract":"Combining correlated p-values from multiple hypothesis testing is a most frequently used method for integrating information in genetic and genomic data analysis. However, most existing methods for combining independent p-values from individual component problems into a single unified p-value are unsuitable for the correlational structure among p-values from multiple hypothesis testing. Although some existing p-value combination methods had been modified to overcome the potential limitations, there is no uniformly most powerful method for combining correlated p-values in genetic data analysis. Therefore, providing a p-value combination method that can robustly control type I errors and keep the good power rates is necessary. In this paper, we propose an empirical method based on the gamma distribution (EMGD) for combining dependent p-values from multiple hypothesis testing. The proposed test, EMGD, allows for flexible accommodating the highly correlated p-values from the multiple hypothesis testing into a unified p-value for examining the combined hypothesis that we are interested in. The EMGD retains the robustness character of the empirical Brown's method (EBM) for pooling the dependent p-values from multiple hypothesis testing. Moreover, the EMGD keeps the character of the method based on the gamma distribution that simultaneously retains the advantages of the z-transform test and the gamma-transform test for combining dependent p-values from multiple statistical tests. The two characters lead to the EMGD that can keep the robust power for combining dependent p-values from multiple hypothesis testing. The performance of the proposed method EMGD is illustrated with simulations and real data applications by comparing with the existing methods, such as Kost and McDermott's method, the EBM and the harmonic mean p-value method.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38572300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian reconstruction of transmission trees from genetic sequences and uncertain infection times. 根据基因序列和不确定的感染时间贝叶斯法重建传播树。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-10-21 DOI: 10.1515/sagmb-2019-0026

Hesam Montazeri, Susan Little, Mozhgan Mozaffarilegha, Niko Beerenwinkel, Victor DeGruttola

Genetic sequence data of pathogens are increasingly used to investigate transmission dynamics in both endemic diseases and disease outbreaks. Such research can aid in the development of appropriate interventions and in the design of studies to evaluate them. Several computational methods have been proposed to infer transmission chains from sequence data; however, existing methods do not generally reliably reconstruct transmission trees because genetic sequence data or inferred phylogenetic trees from such data contain insufficient information for accurate estimation of transmission chains. Here, we show by simulation studies that incorporating infection times, even when they are uncertain, can greatly improve the accuracy of reconstruction of transmission trees. To achieve this improvement, we propose a Bayesian inference methods using Markov chain Monte Carlo that directly draws samples from the space of transmission trees under the assumption of complete sampling of the outbreak. The likelihood of each transmission tree is computed by a phylogenetic model by treating its internal nodes as transmission events. By a simulation study, we demonstrate that accuracy of the reconstructed transmission trees depends mainly on the amount of information available on times of infection; we show superiority of the proposed method to two alternative approaches when infection times are known up to specified degrees of certainty. In addition, we illustrate the use of a multiple imputation framework to study features of epidemic dynamics, such as the relationship between characteristics of nodes and average number of outbound edges or inbound edges, signifying possible transmission events from and to nodes. We apply the proposed method to a transmission cluster in San Diego and to a dataset from the 2014 Sierra Leone Ebola virus outbreak and investigate the impact of biological, behavioral, and demographic factors.

病原体的基因序列数据越来越多地被用于研究地方病和疾病爆发的传播动态。此类研究有助于制定适当的干预措施，也有助于设计对其进行评估的研究。已经提出了几种计算方法来从序列数据中推断传播链；然而，现有的方法一般不能可靠地重建传播树，因为基因序列数据或从此类数据推断出的系统发生树包含的信息不足以准确估计传播链。在这里，我们通过模拟研究表明，即使感染时间不确定，纳入感染时间也能大大提高传播树重建的准确性。为了实现这一改进，我们提出了一种使用马尔科夫链蒙特卡洛的贝叶斯推理方法，在对疫情进行完全采样的假设下，直接从传播树空间中抽取样本。将每个传播树的内部节点视为传播事件，通过系统发生学模型计算其可能性。通过模拟研究，我们证明了重建传播树的准确性主要取决于可获得的感染时间信息量；当感染时间已知到指定的确定程度时，我们证明了所提出的方法优于两种替代方法。此外，我们还说明了如何使用多重归因框架来研究流行病动态特征，例如节点特征与出境边或入境边平均数量之间的关系，这意味着节点之间可能发生的传播事件。我们将提出的方法应用于圣地亚哥的一个传播集群和 2014 年塞拉利昂埃博拉病毒爆发的数据集，并研究了生物、行为和人口因素的影响。

{"title":"Bayesian reconstruction of transmission trees from genetic sequences and uncertain infection times.","authors":"Hesam Montazeri, Susan Little, Mozhgan Mozaffarilegha, Niko Beerenwinkel, Victor DeGruttola","doi":"10.1515/sagmb-2019-0026","DOIUrl":"10.1515/sagmb-2019-0026","url":null,"abstract":"Genetic sequence data of pathogens are increasingly used to investigate transmission dynamics in both endemic diseases and disease outbreaks. Such research can aid in the development of appropriate interventions and in the design of studies to evaluate them. Several computational methods have been proposed to infer transmission chains from sequence data; however, existing methods do not generally reliably reconstruct transmission trees because genetic sequence data or inferred phylogenetic trees from such data contain insufficient information for accurate estimation of transmission chains. Here, we show by simulation studies that incorporating infection times, even when they are uncertain, can greatly improve the accuracy of reconstruction of transmission trees. To achieve this improvement, we propose a Bayesian inference methods using Markov chain Monte Carlo that directly draws samples from the space of transmission trees under the assumption of complete sampling of the outbreak. The likelihood of each transmission tree is computed by a phylogenetic model by treating its internal nodes as transmission events. By a simulation study, we demonstrate that accuracy of the reconstructed transmission trees depends mainly on the amount of information available on times of infection; we show superiority of the proposed method to two alternative approaches when infection times are known up to specified degrees of certainty. In addition, we illustrate the use of a multiple imputation framework to study features of epidemic dynamics, such as the relationship between characteristics of nodes and average number of outbound edges or inbound edges, signifying possible transmission events from and to nodes. We apply the proposed method to a transmission cluster in San Diego and to a dataset from the 2014 Sierra Leone Ebola virus outbreak and investigate the impact of biological, behavioral, and demographic factors.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8212962/pdf/nihms-1709644.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38519010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A weighted empirical Bayes risk prediction model using multiple traits. 多特征加权经验贝叶斯风险预测模型。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2020-09-04 DOI: 10.1515/sagmb-2019-0056

Gengxin Li, Lin Hou, Xiaoyu Liu, Cen Wu

With rapid advances in high-throughput sequencing technology, millions of single-nucleotide variants (SNVs) can be simultaneously genotyped in a sequencing study. These SNVs residing in functional genomic regions such as exons may play a crucial role in biological process of the body. In particular, non-synonymous SNVs are closely related to the protein sequence and its function, which are important in understanding the biological mechanism of sequence evolution. Although statistically challenging, models incorporating such SNV annotation information can improve the estimation of genetic effects, and multiple responses may further strengthen the signals of these variants on the assessment of disease risk. In this work, we develop a new weighted empirical Bayes method to integrate SNV annotation information in a multi-trait design. The performance of this proposed model is evaluated in simulation as well as a real sequencing data; thus, the proposed method shows improved prediction accuracy compared to other approaches.

随着高通量测序技术的快速发展，在测序研究中可以同时对数百万个单核苷酸变异(snv)进行基因分型。这些存在于功能基因组区域(如外显子)的snv可能在机体的生物学过程中起着至关重要的作用。其中，非同义snv与蛋白质序列及其功能密切相关，对理解序列进化的生物学机制具有重要意义。尽管在统计上具有挑战性，但纳入此类SNV注释信息的模型可以改善对遗传效应的估计，并且多重反应可能进一步加强这些变异对疾病风险评估的信号。在这项工作中，我们开发了一种新的加权经验贝叶斯方法来整合多特征设计中的SNV注释信息。在仿真和实际测序数据中对该模型的性能进行了评价;因此，与其他方法相比，该方法具有更高的预测精度。

引用次数: 0