首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
LCox: a tool for selecting genes related to survival outcomes using longitudinal gene expression data. LCox:使用纵向基因表达数据选择与生存结果相关的基因的工具。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-02-13 DOI: 10.1515/sagmb-2017-0060
Jiehuan Sun, Jose D Herazo-Maya, Jane-Ling Wang, Naftali Kaminski, Hongyu Zhao

Longitudinal genomics data and survival outcome are common in biomedical studies, where the genomics data are often of high dimension. It is of great interest to select informative longitudinal biomarkers (e.g. genes) related to the survival outcome. In this paper, we develop a computationally efficient tool, LCox, for selecting informative biomarkers related to the survival outcome using the longitudinal genomics data. LCox is powerful to detect different forms of dependence between the longitudinal biomarkers and the survival outcome. We show that LCox has improved performance compared to existing methods through extensive simulation studies. In addition, by applying LCox to a dataset of patients with idiopathic pulmonary fibrosis, we are able to identify biologically meaningful genes while all other methods fail to make any discovery. An R package to perform LCox is freely available at https://CRAN.R-project.org/package=LCox.

纵向基因组学数据和生存结果在生物医学研究中很常见,其中基因组学数据通常是高维的。选择与生存结果相关的信息性纵向生物标志物(如基因)是非常有趣的。在本文中,我们开发了一个计算效率高的工具LCox,用于使用纵向基因组学数据选择与生存结果相关的信息性生物标志物。LCox在检测纵向生物标志物与生存结果之间不同形式的依赖性方面具有强大的功能。我们通过广泛的仿真研究表明,与现有方法相比,LCox的性能有所提高。此外,通过将LCox应用于特发性肺纤维化患者的数据集,我们能够识别出具有生物学意义的基因,而所有其他方法都无法发现任何基因。执行LCox的R包可以在https://CRAN.R-project.org/package=LCox上免费获得。
{"title":"LCox: a tool for selecting genes related to survival outcomes using longitudinal gene expression data.","authors":"Jiehuan Sun,&nbsp;Jose D Herazo-Maya,&nbsp;Jane-Ling Wang,&nbsp;Naftali Kaminski,&nbsp;Hongyu Zhao","doi":"10.1515/sagmb-2017-0060","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0060","url":null,"abstract":"<p><p>Longitudinal genomics data and survival outcome are common in biomedical studies, where the genomics data are often of high dimension. It is of great interest to select informative longitudinal biomarkers (e.g. genes) related to the survival outcome. In this paper, we develop a computationally efficient tool, LCox, for selecting informative biomarkers related to the survival outcome using the longitudinal genomics data. LCox is powerful to detect different forms of dependence between the longitudinal biomarkers and the survival outcome. We show that LCox has improved performance compared to existing methods through extensive simulation studies. In addition, by applying LCox to a dataset of patients with idiopathic pulmonary fibrosis, we are able to identify biologically meaningful genes while all other methods fail to make any discovery. An R package to perform LCox is freely available at https://CRAN.R-project.org/package=LCox.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0060","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36962842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Meta-analytic framework for modeling genetic coexpression dynamics. 遗传共表达动力学建模的元分析框架。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-02-09 DOI: 10.1515/sagmb-2017-0052
Tyler G Kinzy, Timothy K Starr, George C Tseng, Yen-Yi Ho

Methods for exploring genetic interactions have been developed in an attempt to move beyond single gene analyses. Because biological molecules frequently participate in different processes under various cellular conditions, investigating the changes in gene coexpression patterns under various biological conditions could reveal important regulatory mechanisms. One of the methods for capturing gene coexpression dynamics, named liquid association (LA), quantifies the relationship where the coexpression between two genes is modulated by a third "coordinator" gene. This LA measure offers a natural framework for studying gene coexpression changes and has been applied increasingly to study regulatory networks among genes. With a wealth of publicly available gene expression data, there is a need to develop a meta-analytic framework for LA analysis. In this paper, we incorporated mixed effects when modeling correlation to account for between-studies heterogeneity. For statistical inference about LA, we developed a Markov chain Monte Carlo (MCMC) estimation procedure through a Bayesian hierarchical framework. We evaluated the proposed methods in a set of simulations and illustrated their use in two collections of experimental data sets. The first data set combined 10 pancreatic ductal adenocarcinoma gene expression studies to determine the role of possible coordinator gene USP9X in the Hippo pathway. The second experimental data set consisted of 907 gene expression microarray Escherichia coli experiments from multiple studies publicly available through the Many Microbe Microarray Database website (http://m3d.bu.edu/) and examined genes that coexpress with serA in the presence of coordinator gene Lrp.

探索遗传相互作用的方法已经发展起来,试图超越单基因分析。由于生物分子在不同的细胞条件下频繁参与不同的过程,研究不同生物条件下基因共表达模式的变化可以揭示重要的调控机制。其中一种捕获基因共表达动态的方法,被称为液体关联(LA),量化了两个基因之间的共表达被第三个“协调”基因调节的关系。这种LA测量为研究基因共表达变化提供了一个自然的框架,并越来越多地应用于研究基因间的调控网络。有了大量公开可用的基因表达数据,有必要为LA分析开发一个元分析框架。在本文中,我们在建立相关性模型时纳入了混合效应,以解释研究之间的异质性。对于LA的统计推断,我们通过贝叶斯层次框架开发了一个马尔可夫链蒙特卡罗(MCMC)估计过程。我们在一组模拟中评估了所提出的方法,并在两组实验数据集中说明了它们的使用。第一个数据集结合了10个胰腺导管腺癌基因表达研究,以确定可能的协调基因USP9X在Hippo通路中的作用。第二个实验数据集包括907个基因表达微阵列大肠杆菌实验,这些实验来自多个研究,可通过许多微生物微阵列数据库网站(http://m3d.bu.edu/)公开获得,并检查了在协调基因Lrp存在下与serA共表达的基因。
{"title":"Meta-analytic framework for modeling genetic coexpression dynamics.","authors":"Tyler G Kinzy,&nbsp;Timothy K Starr,&nbsp;George C Tseng,&nbsp;Yen-Yi Ho","doi":"10.1515/sagmb-2017-0052","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0052","url":null,"abstract":"<p><p>Methods for exploring genetic interactions have been developed in an attempt to move beyond single gene analyses. Because biological molecules frequently participate in different processes under various cellular conditions, investigating the changes in gene coexpression patterns under various biological conditions could reveal important regulatory mechanisms. One of the methods for capturing gene coexpression dynamics, named liquid association (LA), quantifies the relationship where the coexpression between two genes is modulated by a third \"coordinator\" gene. This LA measure offers a natural framework for studying gene coexpression changes and has been applied increasingly to study regulatory networks among genes. With a wealth of publicly available gene expression data, there is a need to develop a meta-analytic framework for LA analysis. In this paper, we incorporated mixed effects when modeling correlation to account for between-studies heterogeneity. For statistical inference about LA, we developed a Markov chain Monte Carlo (MCMC) estimation procedure through a Bayesian hierarchical framework. We evaluated the proposed methods in a set of simulations and illustrated their use in two collections of experimental data sets. The first data set combined 10 pancreatic ductal adenocarcinoma gene expression studies to determine the role of possible coordinator gene USP9X in the Hippo pathway. The second experimental data set consisted of 907 gene expression microarray Escherichia coli experiments from multiple studies publicly available through the Many Microbe Microarray Database website (http://m3d.bu.edu/) and examined genes that coexpress with serA in the presence of coordinator gene Lrp.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0052","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36944546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Sliced inverse regression for integrative multi-omics data analysis. 切片逆回归整合多组学数据分析。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-01-26 DOI: 10.1515/sagmb-2018-0028
Yashita Jain, Shanshan Ding, Jing Qiu

Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.

新一代测序、转录组学、蛋白质组学和其他高通量技术的进步,使多种癌症样本基因组数据的同时测量成为可能。与分析单一基因组类型数据相比,这些数据加在一起可能会揭示新的生物学见解。本研究提出了一种新的监督降维方法,称为切片逆回归,用于多组学数据分析,以提高对单一数据类型分析的预测。本研究进一步提出了一种整合切片逆回归方法(integrative slicing inverse regression method, integrated SIR),用于同时分析癌症样本的多组学数据类型,包括MiRNA、MRNA和蛋白质组学,实现整合降维,进一步提高预测性能。数值结果表明,与单一数据源分析相比,多组学数据的整合分析是有益的,更重要的是,与无监督降维方法相比,监督降维方法在整合数据分析中的分类和预测方面具有优势。
{"title":"Sliced inverse regression for integrative multi-omics data analysis.","authors":"Yashita Jain,&nbsp;Shanshan Ding,&nbsp;Jing Qiu","doi":"10.1515/sagmb-2018-0028","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0028","url":null,"abstract":"<p><p>Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0028","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36901134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A powerful test for ordinal trait genetic association analysis. 序性状遗传关联分析的有力检验。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-01-26 DOI: 10.1515/sagmb-2017-0066
Yuan Xue, Jinjuan Wang, Juan Ding, Sanguo Zhang, Qizhai Li

Response selective sampling design is commonly adopted in genetic epidemiologic study because it can substantially reduce time cost and increase power of identifying deleterious genetic variants predispose to human complex disease comparing with prospective design. The proportional odds model (POM) can be used to fit data obtained by this design. Unlike the logistic regression model, the estimated genetic effect based on POM by taking data as being enrolled prospectively is inconsistent. So the power of resulted Wald test is not satisfactory. The modified POM is suitable to fit this type of data, however, the corresponding Wald test is not optimal when the genetic effect is small. Here, we propose a new association test to handle this issue. Simulation studies show that the proposed test can control the type I error rate correctly and is more powerful than two existing methods. Finally, we applied three tests to Anticyclic Citrullinated Protein Antibody data from Genetic Workshop 16.

与前瞻性设计相比,反应选择性抽样设计可以大大减少时间成本,提高识别人类复杂疾病易感性的有害遗传变异的能力,是遗传流行病学研究中常用的方法。比例赔率模型(POM)可用于拟合该设计获得的数据。与logistic回归模型不同,采用前瞻性纳入数据的POM估计遗传效应是不一致的。因此,得到的Wald检验的有效性不能令人满意。改进的POM适合拟合这类数据,但当遗传效应较小时,Wald检验不是最优的。在这里,我们提出一个新的关联测试来解决这个问题。仿真研究表明,该方法能较好地控制I型错误率,比现有的两种方法更有效。最后,我们对来自遗传研讨会16的抗环瓜氨酸蛋白抗体数据进行了三种测试。
{"title":"A powerful test for ordinal trait genetic association analysis.","authors":"Yuan Xue,&nbsp;Jinjuan Wang,&nbsp;Juan Ding,&nbsp;Sanguo Zhang,&nbsp;Qizhai Li","doi":"10.1515/sagmb-2017-0066","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0066","url":null,"abstract":"<p><p>Response selective sampling design is commonly adopted in genetic epidemiologic study because it can substantially reduce time cost and increase power of identifying deleterious genetic variants predispose to human complex disease comparing with prospective design. The proportional odds model (POM) can be used to fit data obtained by this design. Unlike the logistic regression model, the estimated genetic effect based on POM by taking data as being enrolled prospectively is inconsistent. So the power of resulted Wald test is not satisfactory. The modified POM is suitable to fit this type of data, however, the corresponding Wald test is not optimal when the genetic effect is small. Here, we propose a new association test to handle this issue. Simulation studies show that the proposed test can control the type I error rate correctly and is more powerful than two existing methods. Finally, we applied three tests to Anticyclic Citrullinated Protein Antibody data from Genetic Workshop 16.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36901132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model. 使用负二项回归模型计算RNA-seq数据差异表达分析的样本量。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-01-22 DOI: 10.1515/sagmb-2018-0021
Xiaohong Li, Dongfeng Wu, Nigel G F Cooper, Shesh N Rai

High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.

高通量RNA测序(RNA-seq)技术越来越多地应用于疾病相关生物标志物的研究。负二项分布已成为RNA-seq数据中基因读取计数建模的流行选择,因为读取计数过于分散。在本研究中,我们使用负二项回归模型对RNA-seq数据提出了两种显式样本量计算方法。为了得到这些新的样本量公式,将常见的色散参数和作为偏移量的大小因子通过自然对数链接函数结合起来。从系数参数导出的双侧Wald检验统计量用于在名义显著性水平0.05下测试单个基因,在错误发现率0.05下测试多个基因。Wald检验的方差由方差-协方差矩阵计算,参数由无限制和约束情景下的最大似然估计估计。通过仿真研究评估了新公式的性能,并将其与现有的三种方法(Wald检验、似然比检验或精确检验)进行了并排比较。由于其他方法的计算量很大,我们推荐我们的M1方法在实验设计中快速直接估计样本量。最后,我们使用现有的乳腺癌RNA-seq数据说明样本量估计。
{"title":"Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model.","authors":"Xiaohong Li,&nbsp;Dongfeng Wu,&nbsp;Nigel G F Cooper,&nbsp;Shesh N Rai","doi":"10.1515/sagmb-2018-0021","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0021","url":null,"abstract":"<p><p>High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0021","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36885647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
MLML2R: an R package for maximum likelihood estimation of DNA methylation and hydroxymethylation proportions. MLML2R:一个R包DNA甲基化和羟甲基化比例的最大似然估计。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-01-17 DOI: 10.1515/sagmb-2018-0031
Samara F Kiihl, Maria Jose Martinez-Garrido, Arce Domingo-Relloso, Jose Bermudez, Maria Tellez-Plaza

Accurately measuring epigenetic marks such as 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) at the single-nucleotide level, requires combining data from DNA processing methods including traditional (BS), oxidative (oxBS) or Tet-Assisted (TAB) bisulfite conversion. We introduce the R package MLML2R, which provides maximum likelihood estimates (MLE) of 5-mC and 5-hmC proportions. While all other available R packages provide 5-mC and 5-hmC MLEs only for the oxBS+BS combination, MLML2R also provides MLE for TAB combinations. For combinations of any two of the methods, we derived the pool-adjacent-violators algorithm (PAVA) exact constrained MLE in analytical form. For the three methods combination, we implemented both the iterative method by Qu et al. [Qu, J., M. Zhou, Q. Song, E. E. Hong and A. D. Smith (2013): "Mlml: consistent simultaneous estimates of dna methylation and hydroxymethylation," Bioinformatics, 29, 2645-2646.], and also a novel non iterative approximation using Lagrange multipliers. The newly proposed non iterative solutions greatly decrease computational time, common bottlenecks when processing high-throughput data. The MLML2R package is flexible as it takes as input both, preprocessed intensities from Infinium Methylation arrays and counts from Next Generation Sequencing technologies. The MLML2R package is freely available at https://CRAN.R-project.org/package=MLML2R.

在单核苷酸水平上精确测量5-甲基胞嘧啶(5-mC)和5-羟甲基胞嘧啶(5-hmC)等表观遗传标记,需要结合DNA处理方法的数据,包括传统(BS),氧化(oxBS)或et辅助(TAB)亚硫酸氢盐转化。我们介绍了R包MLML2R,它提供了5-mC和5-hmC比例的最大似然估计(MLE)。虽然所有其他可用的R包仅为oxBS+BS组合提供5-mC和5-hmC MLE,但MLML2R还为TAB组合提供了MLE。对于任意两种方法的组合,我们以解析形式导出了池邻接违反者算法(PAVA)的精确约束MLE。对于这三种方法的组合,我们实现了Qu等人的迭代方法[Qu, J, M. Zhou, Q. Song, E. E. Hong and A. D. Smith(2013):“Mlml: dna甲基化和羟甲基化的一致同时估计”,生物信息学,29,2645-2646。],以及使用拉格朗日乘法器的一种新颖的非迭代近似。新提出的非迭代解决方案大大减少了处理高吞吐量数据时常见的计算时间瓶颈。MLML2R封装是灵活的,因为它需要输入,来自Infinium甲基化阵列的预处理强度和来自下一代测序技术的计数。MLML2R包可在https://CRAN.R-project.org/package=MLML2R免费获得。
{"title":"MLML2R: an R package for maximum likelihood estimation of DNA methylation and hydroxymethylation proportions.","authors":"Samara F Kiihl,&nbsp;Maria Jose Martinez-Garrido,&nbsp;Arce Domingo-Relloso,&nbsp;Jose Bermudez,&nbsp;Maria Tellez-Plaza","doi":"10.1515/sagmb-2018-0031","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0031","url":null,"abstract":"<p><p>Accurately measuring epigenetic marks such as 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) at the single-nucleotide level, requires combining data from DNA processing methods including traditional (BS), oxidative (oxBS) or Tet-Assisted (TAB) bisulfite conversion. We introduce the R package MLML2R, which provides maximum likelihood estimates (MLE) of 5-mC and 5-hmC proportions. While all other available R packages provide 5-mC and 5-hmC MLEs only for the oxBS+BS combination, MLML2R also provides MLE for TAB combinations. For combinations of any two of the methods, we derived the pool-adjacent-violators algorithm (PAVA) exact constrained MLE in analytical form. For the three methods combination, we implemented both the iterative method by Qu et al. [Qu, J., M. Zhou, Q. Song, E. E. Hong and A. D. Smith (2013): \"Mlml: consistent simultaneous estimates of dna methylation and hydroxymethylation,\" Bioinformatics, 29, 2645-2646.], and also a novel non iterative approximation using Lagrange multipliers. The newly proposed non iterative solutions greatly decrease computational time, common bottlenecks when processing high-throughput data. The MLML2R package is flexible as it takes as input both, preprocessed intensities from Infinium Methylation arrays and counts from Next Generation Sequencing technologies. The MLML2R package is freely available at https://CRAN.R-project.org/package=MLML2R.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36872982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data 从Hi-C数据中鉴定远距离染色体相互作用的经验贝叶斯方法
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-12-17 DOI: 10.1101/497776
Qi Zhang, Zheng Xu, Yutong Lai
Abstract Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the “true” interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).
近年来,Hi-C实验已成为研究三维基因组结构的热门方法。鉴定远距离染色体相互作用,即峰检测,对Hi-C数据分析至关重要。但由于Hi-C计数数据矩阵固有的高维性、稀疏性和过色散性,这仍然是一项具有挑战性的任务。我们提出了EBHiC,一种从Hi-C数据中检测峰的经验贝叶斯方法。所提出的框架通过明确地包括“真实”相互作用强度作为潜在变量,提供了灵活的过分散建模。为了实现所提出的峰值识别方法(通过经验贝叶斯检验),我们使用平滑期望最大化算法估计观测计数的半参数总体分布,并基于零假设估计经验零。我们进行了大量的模拟来验证和评估我们提出的方法的性能,并将其应用于实际数据集。我们的研究结果表明,EBHiC在准确性、生物可解释性和跨生物重复的一致性方面可以识别出更好的峰。源代码可在Github (https://github.com/QiZhangStat/EBHiC)上获得。
{"title":"An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data","authors":"Qi Zhang, Zheng Xu, Yutong Lai","doi":"10.1101/497776","DOIUrl":"https://doi.org/10.1101/497776","url":null,"abstract":"Abstract Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the “true” interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44544987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
False discovery control for penalized variable selections with high-dimensional covariates. 具有高维协变量的惩罚性变量选择的错误发现控制。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-12-15 DOI: 10.1515/sagmb-2018-0038
Kevin He, Xiang Zhou, Hui Jiang, Xiaoquan Wen, Yi Li

Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.

现代生物技术产生了大量的高通量数据,预测因子的数量远远超过了样本量。惩罚性变量选择已成为一种强大而高效的降维工具。然而,在惩罚性高维变量选择中控制错误发现(即包含无关变量)是一项严峻的挑战。为了有效控制惩罚性变量选择的错误发现率,我们提出了一种错误发现控制程序。所提出的方法具有通用性和灵活性,可用于多种变量选择算法,不仅适用于线性回归,还适用于广义线性模型和生存分析。
{"title":"False discovery control for penalized variable selections with high-dimensional covariates.","authors":"Kevin He, Xiang Zhou, Hui Jiang, Xiaoquan Wen, Yi Li","doi":"10.1515/sagmb-2018-0038","DOIUrl":"10.1515/sagmb-2018-0038","url":null,"abstract":"<p><p>Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6450074/pdf/nihms-1015624.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37050068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS). 在全基因组关联研究中调整人群分层的实用方法:主成分和倾向分数 (PCAPS)。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-12-04 DOI: 10.1515/sagmb-2017-0054
Huaqing Zhao, Nandita Mitra, Peter A Kanetsky, Katherine L Nathanson, Timothy R Rebbeck

Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.

全基因组关联研究(GWAS)很容易因人群分层(PS)而产生偏差。校正群体分层偏倚最广泛使用的方法是主成分分析(PCA),但目前还没有客观的方法来指导将哪些主成分作为协变量。通常情况下,我们会将特征值最高的十个 PC 纳入进来,以调整 PS。这种选择是任意的,而且局部连锁不平衡的模式可能会影响 PCA 校正。为了解决这些局限性,我们根据特雷西-维多姆(Tracy-Widom,TW)统计量选出的所有具有统计意义的 PC 来估算基因组倾向得分。我们使用无、中度和重度 PS 下的模拟 GWAS 数据,比较了主成分和倾向得分(PCAPS)方法与 PCA 和 EMMAX。无论 PS 的程度如何,PCAPS 都能减少虚假的遗传关联,从而使比值比 (OR) 估计值更接近真实 OR。我们使用睾丸生殖细胞肿瘤研究的 GWAS 数据来说明 PCAPS 方法。PCAPS 提供了比 PCA 更为保守的调整。PCAPS 方法的优点包括:与 PCA 相比减少了偏差、选择一致的倾向分数来调整 PS、具有处理异常值的潜在能力以及易于使用现有软件包实施。
{"title":"A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS).","authors":"Huaqing Zhao, Nandita Mitra, Peter A Kanetsky, Katherine L Nathanson, Timothy R Rebbeck","doi":"10.1515/sagmb-2017-0054","DOIUrl":"10.1515/sagmb-2017-0054","url":null,"abstract":"<p><p>Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6475581/pdf/nihms-1022442.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36745351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel method to accurately calculate statistical significance of local similarity analysis for high-throughput time series. 一种精确计算高通量时间序列局部相似度统计显著性的新方法。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-11-17 DOI: 10.1515/sagmb-2018-0019
Fang Zhang, Ang Shan, Yihui Luan

In recent years, a large number of time series microbial community data has been produced in molecular biological studies, especially in metagenomics. Among the statistical methods for time series, local similarity analysis is used in a wide range of environments to capture potential local and time-shifted associations that cannot be distinguished by traditional correlation analysis. Initially, the permutation test is popularly applied to obtain the statistical significance of local similarity analysis. More recently, a theoretical method has also been developed to achieve this aim. However, all these methods require the assumption that the time series are independent and identically distributed. In this paper, we propose a new approach based on moving block bootstrap to approximate the statistical significance of local similarity scores for dependent time series. Simulations show that our method can control the type I error rate reasonably, while theoretical approximation and the permutation test perform less well. Finally, our method is applied to human and marine microbial community datasets, indicating that it can identify potential relationship among operational taxonomic units (OTUs) and significantly decrease the rate of false positives.

近年来,分子生物学特别是宏基因组学研究中产生了大量的时间序列微生物群落数据。在时间序列的统计方法中,局部相似度分析用于广泛的环境中,以捕获传统相关分析无法区分的潜在局部关联和时移关联。最初,人们普遍采用排列检验来获得局部相似性分析的统计显著性。最近,也发展了一种理论方法来实现这一目标。然而,所有这些方法都要求假设时间序列是独立的和同分布的。在本文中,我们提出了一种新的基于移动块自举的方法来近似依赖时间序列的局部相似分数的统计显著性。仿真结果表明,该方法能较好地控制第一类错误率,而理论逼近和排列测试的效果较差。最后,将该方法应用于人类和海洋微生物群落数据集,结果表明该方法可以识别出操作分类单元(otu)之间的潜在关系,并显著降低了误报率。
{"title":"A novel method to accurately calculate statistical significance of local similarity analysis for high-throughput time series.","authors":"Fang Zhang,&nbsp;Ang Shan,&nbsp;Yihui Luan","doi":"10.1515/sagmb-2018-0019","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0019","url":null,"abstract":"<p><p>In recent years, a large number of time series microbial community data has been produced in molecular biological studies, especially in metagenomics. Among the statistical methods for time series, local similarity analysis is used in a wide range of environments to capture potential local and time-shifted associations that cannot be distinguished by traditional correlation analysis. Initially, the permutation test is popularly applied to obtain the statistical significance of local similarity analysis. More recently, a theoretical method has also been developed to achieve this aim. However, all these methods require the assumption that the time series are independent and identically distributed. In this paper, we propose a new approach based on moving block bootstrap to approximate the statistical significance of local similarity scores for dependent time series. Simulations show that our method can control the type I error rate reasonably, while theoretical approximation and the permutation test perform less well. Finally, our method is applied to human and marine microbial community datasets, indicating that it can identify potential relationship among operational taxonomic units (OTUs) and significantly decrease the rate of false positives.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36739757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1