首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
netprioR: a probabilistic model for integrative hit prioritisation of genetic screens. netprioR:遗传筛选综合命中优先级的概率模型。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-03-06 DOI: 10.1515/sagmb-2018-0033
Fabian Schmich, Jack Kuipers, Gunter Merdes, Niko Beerenwinkel

In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene-gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.

在生物学大数据的后基因组时代,整合多个异构数据集的计算方法变得越来越重要。尽管有大量的组学数据,但基于遗传筛选实验的特定功能通路相关基因的优先级仍然是一项具有挑战性的任务。在这里,我们介绍了netprioR,一个概率生成模型,用于对命中基因进行半监督整合优先排序。该模型整合了多个网络数据集,这些数据集代表基因的相似性和文献中关于基因功能的先验知识,以及基于基因的协变量,例如通过RNA干扰或CRISPR/Cas9在遗传扰动筛选中测量的表型。我们在模拟数据上评估了netprioR,并表明该模型在许多情况下优于当前最先进的方法,在其他方面也不相上下。在实际生物学数据的应用中,我们整合了22个网络数据集,1784个先验知识类别标签和3840个RNA干扰表型,以便优先考虑果蝇Notch信号的新调节因子。我们的预测的生物学相关性是评估使用在硅和体内实验。netprioR的有效实现可以在http://bioconductor.org/packages/netprioR上以R包的形式获得。
{"title":"netprioR: a probabilistic model for integrative hit prioritisation of genetic screens.","authors":"Fabian Schmich,&nbsp;Jack Kuipers,&nbsp;Gunter Merdes,&nbsp;Niko Beerenwinkel","doi":"10.1515/sagmb-2018-0033","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0033","url":null,"abstract":"<p><p>In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene-gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37204570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences. 基于离散小波包变换的全基因组序列判别分析。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-02-15 DOI: 10.1515/sagmb-2018-0045
Hsin-Hsiung Huang, Senthil Balaji Girimurugan
Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.
近年来,无比对方法在基因组序列比较中得到了广泛的应用,因为这些方法计算效率高,可以提供理想的系统发育分析结果。这些方法已成功地与分层聚类方法相结合,用于寻找系统发育树。然而,将这些无对齐的方法直接应用到现有的统计分类方法中可能并不合适,因为目前还缺乏与无对齐表示方法相结合的合适的统计分类理论。本文提出了一种利用离散小波包变换对全基因组序列进行分类的判别分析方法。所提出的特征的无对齐表示统计量渐近地服从联合正态分布。数据分析结果表明,该方法在实时性上取得了令人满意的分类效果。
{"title":"Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences.","authors":"Hsin-Hsiung Huang,&nbsp;Senthil Balaji Girimurugan","doi":"10.1515/sagmb-2018-0045","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0045","url":null,"abstract":"Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36963300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
LCox: a tool for selecting genes related to survival outcomes using longitudinal gene expression data. LCox:使用纵向基因表达数据选择与生存结果相关的基因的工具。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-02-13 DOI: 10.1515/sagmb-2017-0060
Jiehuan Sun, Jose D Herazo-Maya, Jane-Ling Wang, Naftali Kaminski, Hongyu Zhao

Longitudinal genomics data and survival outcome are common in biomedical studies, where the genomics data are often of high dimension. It is of great interest to select informative longitudinal biomarkers (e.g. genes) related to the survival outcome. In this paper, we develop a computationally efficient tool, LCox, for selecting informative biomarkers related to the survival outcome using the longitudinal genomics data. LCox is powerful to detect different forms of dependence between the longitudinal biomarkers and the survival outcome. We show that LCox has improved performance compared to existing methods through extensive simulation studies. In addition, by applying LCox to a dataset of patients with idiopathic pulmonary fibrosis, we are able to identify biologically meaningful genes while all other methods fail to make any discovery. An R package to perform LCox is freely available at https://CRAN.R-project.org/package=LCox.

纵向基因组学数据和生存结果在生物医学研究中很常见,其中基因组学数据通常是高维的。选择与生存结果相关的信息性纵向生物标志物(如基因)是非常有趣的。在本文中,我们开发了一个计算效率高的工具LCox,用于使用纵向基因组学数据选择与生存结果相关的信息性生物标志物。LCox在检测纵向生物标志物与生存结果之间不同形式的依赖性方面具有强大的功能。我们通过广泛的仿真研究表明,与现有方法相比,LCox的性能有所提高。此外,通过将LCox应用于特发性肺纤维化患者的数据集,我们能够识别出具有生物学意义的基因,而所有其他方法都无法发现任何基因。执行LCox的R包可以在https://CRAN.R-project.org/package=LCox上免费获得。
{"title":"LCox: a tool for selecting genes related to survival outcomes using longitudinal gene expression data.","authors":"Jiehuan Sun,&nbsp;Jose D Herazo-Maya,&nbsp;Jane-Ling Wang,&nbsp;Naftali Kaminski,&nbsp;Hongyu Zhao","doi":"10.1515/sagmb-2017-0060","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0060","url":null,"abstract":"<p><p>Longitudinal genomics data and survival outcome are common in biomedical studies, where the genomics data are often of high dimension. It is of great interest to select informative longitudinal biomarkers (e.g. genes) related to the survival outcome. In this paper, we develop a computationally efficient tool, LCox, for selecting informative biomarkers related to the survival outcome using the longitudinal genomics data. LCox is powerful to detect different forms of dependence between the longitudinal biomarkers and the survival outcome. We show that LCox has improved performance compared to existing methods through extensive simulation studies. In addition, by applying LCox to a dataset of patients with idiopathic pulmonary fibrosis, we are able to identify biologically meaningful genes while all other methods fail to make any discovery. An R package to perform LCox is freely available at https://CRAN.R-project.org/package=LCox.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0060","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36962842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Meta-analytic framework for modeling genetic coexpression dynamics. 遗传共表达动态建模的元分析框架。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-02-09 DOI: 10.1515/sagmb-2017-0052
Tyler G Kinzy, Timothy K Starr, George C Tseng, Yen-Yi Ho

Methods for exploring genetic interactions have been developed in an attempt to move beyond single gene analyses. Because biological molecules frequently participate in different processes under various cellular conditions, investigating the changes in gene coexpression patterns under various biological conditions could reveal important regulatory mechanisms. One of the methods for capturing gene coexpression dynamics, named liquid association (LA), quantifies the relationship where the coexpression between two genes is modulated by a third "coordinator" gene. This LA measure offers a natural framework for studying gene coexpression changes and has been applied increasingly to study regulatory networks among genes. With a wealth of publicly available gene expression data, there is a need to develop a meta-analytic framework for LA analysis. In this paper, we incorporated mixed effects when modeling correlation to account for between-studies heterogeneity. For statistical inference about LA, we developed a Markov chain Monte Carlo (MCMC) estimation procedure through a Bayesian hierarchical framework. We evaluated the proposed methods in a set of simulations and illustrated their use in two collections of experimental data sets. The first data set combined 10 pancreatic ductal adenocarcinoma gene expression studies to determine the role of possible coordinator gene USP9X in the Hippo pathway. The second experimental data set consisted of 907 gene expression microarray Escherichia coli experiments from multiple studies publicly available through the Many Microbe Microarray Database website (http://m3d.bu.edu/) and examined genes that coexpress with serA in the presence of coordinator gene Lrp.

为了超越单基因分析,人们开发了探索基因相互作用的方法。由于生物分子经常参与各种细胞条件下的不同过程,因此研究各种生物条件下基因共表达模式的变化可以揭示重要的调控机制。其中一种捕捉基因共表达动态的方法被命名为液体关联(LA),它量化了两个基因之间的共表达受第三个 "协调 "基因调节的关系。液态关联测量为研究基因共表达变化提供了一个自然框架,并越来越多地被应用于研究基因间的调控网络。随着大量基因表达数据的公开,有必要为 LA 分析开发一个元分析框架。在本文中,我们在建立相关性模型时加入了混合效应,以考虑研究间的异质性。为了对 LA 进行统计推断,我们通过贝叶斯分层框架开发了马尔科夫链蒙特卡罗(MCMC)估计程序。我们在一组模拟中评估了所提出的方法,并在两组实验数据中说明了这些方法的用途。第一个数据集结合了 10 个胰腺导管腺癌基因表达研究,以确定可能的协调基因 USP9X 在 Hippo 通路中的作用。第二个实验数据集包括 907 个基因表达微阵列大肠杆菌实验,这些实验来自多项研究,可通过许多微生物微阵列数据库网站(http://m3d.bu.edu/)公开获取,并研究了在协调基因 Lrp 存在的情况下与 serA 共同表达的基因。
{"title":"Meta-analytic framework for modeling genetic coexpression dynamics.","authors":"Tyler G Kinzy, Timothy K Starr, George C Tseng, Yen-Yi Ho","doi":"10.1515/sagmb-2017-0052","DOIUrl":"10.1515/sagmb-2017-0052","url":null,"abstract":"<p><p>Methods for exploring genetic interactions have been developed in an attempt to move beyond single gene analyses. Because biological molecules frequently participate in different processes under various cellular conditions, investigating the changes in gene coexpression patterns under various biological conditions could reveal important regulatory mechanisms. One of the methods for capturing gene coexpression dynamics, named liquid association (LA), quantifies the relationship where the coexpression between two genes is modulated by a third \"coordinator\" gene. This LA measure offers a natural framework for studying gene coexpression changes and has been applied increasingly to study regulatory networks among genes. With a wealth of publicly available gene expression data, there is a need to develop a meta-analytic framework for LA analysis. In this paper, we incorporated mixed effects when modeling correlation to account for between-studies heterogeneity. For statistical inference about LA, we developed a Markov chain Monte Carlo (MCMC) estimation procedure through a Bayesian hierarchical framework. We evaluated the proposed methods in a set of simulations and illustrated their use in two collections of experimental data sets. The first data set combined 10 pancreatic ductal adenocarcinoma gene expression studies to determine the role of possible coordinator gene USP9X in the Hippo pathway. The second experimental data set consisted of 907 gene expression microarray Escherichia coli experiments from multiple studies publicly available through the Many Microbe Microarray Database website (http://m3d.bu.edu/) and examined genes that coexpress with serA in the presence of coordinator gene Lrp.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36944546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sliced inverse regression for integrative multi-omics data analysis. 切片逆回归整合多组学数据分析。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-01-26 DOI: 10.1515/sagmb-2018-0028
Yashita Jain, Shanshan Ding, Jing Qiu

Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.

新一代测序、转录组学、蛋白质组学和其他高通量技术的进步,使多种癌症样本基因组数据的同时测量成为可能。与分析单一基因组类型数据相比,这些数据加在一起可能会揭示新的生物学见解。本研究提出了一种新的监督降维方法,称为切片逆回归,用于多组学数据分析,以提高对单一数据类型分析的预测。本研究进一步提出了一种整合切片逆回归方法(integrative slicing inverse regression method, integrated SIR),用于同时分析癌症样本的多组学数据类型,包括MiRNA、MRNA和蛋白质组学,实现整合降维,进一步提高预测性能。数值结果表明,与单一数据源分析相比,多组学数据的整合分析是有益的,更重要的是,与无监督降维方法相比,监督降维方法在整合数据分析中的分类和预测方面具有优势。
{"title":"Sliced inverse regression for integrative multi-omics data analysis.","authors":"Yashita Jain,&nbsp;Shanshan Ding,&nbsp;Jing Qiu","doi":"10.1515/sagmb-2018-0028","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0028","url":null,"abstract":"<p><p>Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0028","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36901134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A powerful test for ordinal trait genetic association analysis. 序性状遗传关联分析的有力检验。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-01-26 DOI: 10.1515/sagmb-2017-0066
Yuan Xue, Jinjuan Wang, Juan Ding, Sanguo Zhang, Qizhai Li

Response selective sampling design is commonly adopted in genetic epidemiologic study because it can substantially reduce time cost and increase power of identifying deleterious genetic variants predispose to human complex disease comparing with prospective design. The proportional odds model (POM) can be used to fit data obtained by this design. Unlike the logistic regression model, the estimated genetic effect based on POM by taking data as being enrolled prospectively is inconsistent. So the power of resulted Wald test is not satisfactory. The modified POM is suitable to fit this type of data, however, the corresponding Wald test is not optimal when the genetic effect is small. Here, we propose a new association test to handle this issue. Simulation studies show that the proposed test can control the type I error rate correctly and is more powerful than two existing methods. Finally, we applied three tests to Anticyclic Citrullinated Protein Antibody data from Genetic Workshop 16.

与前瞻性设计相比,反应选择性抽样设计可以大大减少时间成本,提高识别人类复杂疾病易感性的有害遗传变异的能力,是遗传流行病学研究中常用的方法。比例赔率模型(POM)可用于拟合该设计获得的数据。与logistic回归模型不同,采用前瞻性纳入数据的POM估计遗传效应是不一致的。因此,得到的Wald检验的有效性不能令人满意。改进的POM适合拟合这类数据,但当遗传效应较小时,Wald检验不是最优的。在这里,我们提出一个新的关联测试来解决这个问题。仿真研究表明,该方法能较好地控制I型错误率,比现有的两种方法更有效。最后,我们对来自遗传研讨会16的抗环瓜氨酸蛋白抗体数据进行了三种测试。
{"title":"A powerful test for ordinal trait genetic association analysis.","authors":"Yuan Xue,&nbsp;Jinjuan Wang,&nbsp;Juan Ding,&nbsp;Sanguo Zhang,&nbsp;Qizhai Li","doi":"10.1515/sagmb-2017-0066","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0066","url":null,"abstract":"<p><p>Response selective sampling design is commonly adopted in genetic epidemiologic study because it can substantially reduce time cost and increase power of identifying deleterious genetic variants predispose to human complex disease comparing with prospective design. The proportional odds model (POM) can be used to fit data obtained by this design. Unlike the logistic regression model, the estimated genetic effect based on POM by taking data as being enrolled prospectively is inconsistent. So the power of resulted Wald test is not satisfactory. The modified POM is suitable to fit this type of data, however, the corresponding Wald test is not optimal when the genetic effect is small. Here, we propose a new association test to handle this issue. Simulation studies show that the proposed test can control the type I error rate correctly and is more powerful than two existing methods. Finally, we applied three tests to Anticyclic Citrullinated Protein Antibody data from Genetic Workshop 16.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36901132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model. 使用负二项回归模型计算RNA-seq数据差异表达分析的样本量。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-01-22 DOI: 10.1515/sagmb-2018-0021
Xiaohong Li, Dongfeng Wu, Nigel G F Cooper, Shesh N Rai

High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.

高通量RNA测序(RNA-seq)技术越来越多地应用于疾病相关生物标志物的研究。负二项分布已成为RNA-seq数据中基因读取计数建模的流行选择,因为读取计数过于分散。在本研究中,我们使用负二项回归模型对RNA-seq数据提出了两种显式样本量计算方法。为了得到这些新的样本量公式,将常见的色散参数和作为偏移量的大小因子通过自然对数链接函数结合起来。从系数参数导出的双侧Wald检验统计量用于在名义显著性水平0.05下测试单个基因,在错误发现率0.05下测试多个基因。Wald检验的方差由方差-协方差矩阵计算,参数由无限制和约束情景下的最大似然估计估计。通过仿真研究评估了新公式的性能,并将其与现有的三种方法(Wald检验、似然比检验或精确检验)进行了并排比较。由于其他方法的计算量很大,我们推荐我们的M1方法在实验设计中快速直接估计样本量。最后,我们使用现有的乳腺癌RNA-seq数据说明样本量估计。
{"title":"Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model.","authors":"Xiaohong Li,&nbsp;Dongfeng Wu,&nbsp;Nigel G F Cooper,&nbsp;Shesh N Rai","doi":"10.1515/sagmb-2018-0021","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0021","url":null,"abstract":"<p><p>High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0021","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36885647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
MLML2R: an R package for maximum likelihood estimation of DNA methylation and hydroxymethylation proportions. MLML2R:一个R包DNA甲基化和羟甲基化比例的最大似然估计。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-01-17 DOI: 10.1515/sagmb-2018-0031
Samara F Kiihl, Maria Jose Martinez-Garrido, Arce Domingo-Relloso, Jose Bermudez, Maria Tellez-Plaza

Accurately measuring epigenetic marks such as 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) at the single-nucleotide level, requires combining data from DNA processing methods including traditional (BS), oxidative (oxBS) or Tet-Assisted (TAB) bisulfite conversion. We introduce the R package MLML2R, which provides maximum likelihood estimates (MLE) of 5-mC and 5-hmC proportions. While all other available R packages provide 5-mC and 5-hmC MLEs only for the oxBS+BS combination, MLML2R also provides MLE for TAB combinations. For combinations of any two of the methods, we derived the pool-adjacent-violators algorithm (PAVA) exact constrained MLE in analytical form. For the three methods combination, we implemented both the iterative method by Qu et al. [Qu, J., M. Zhou, Q. Song, E. E. Hong and A. D. Smith (2013): "Mlml: consistent simultaneous estimates of dna methylation and hydroxymethylation," Bioinformatics, 29, 2645-2646.], and also a novel non iterative approximation using Lagrange multipliers. The newly proposed non iterative solutions greatly decrease computational time, common bottlenecks when processing high-throughput data. The MLML2R package is flexible as it takes as input both, preprocessed intensities from Infinium Methylation arrays and counts from Next Generation Sequencing technologies. The MLML2R package is freely available at https://CRAN.R-project.org/package=MLML2R.

在单核苷酸水平上精确测量5-甲基胞嘧啶(5-mC)和5-羟甲基胞嘧啶(5-hmC)等表观遗传标记,需要结合DNA处理方法的数据,包括传统(BS),氧化(oxBS)或et辅助(TAB)亚硫酸氢盐转化。我们介绍了R包MLML2R,它提供了5-mC和5-hmC比例的最大似然估计(MLE)。虽然所有其他可用的R包仅为oxBS+BS组合提供5-mC和5-hmC MLE,但MLML2R还为TAB组合提供了MLE。对于任意两种方法的组合,我们以解析形式导出了池邻接违反者算法(PAVA)的精确约束MLE。对于这三种方法的组合,我们实现了Qu等人的迭代方法[Qu, J, M. Zhou, Q. Song, E. E. Hong and A. D. Smith(2013):“Mlml: dna甲基化和羟甲基化的一致同时估计”,生物信息学,29,2645-2646。],以及使用拉格朗日乘法器的一种新颖的非迭代近似。新提出的非迭代解决方案大大减少了处理高吞吐量数据时常见的计算时间瓶颈。MLML2R封装是灵活的,因为它需要输入,来自Infinium甲基化阵列的预处理强度和来自下一代测序技术的计数。MLML2R包可在https://CRAN.R-project.org/package=MLML2R免费获得。
{"title":"MLML2R: an R package for maximum likelihood estimation of DNA methylation and hydroxymethylation proportions.","authors":"Samara F Kiihl,&nbsp;Maria Jose Martinez-Garrido,&nbsp;Arce Domingo-Relloso,&nbsp;Jose Bermudez,&nbsp;Maria Tellez-Plaza","doi":"10.1515/sagmb-2018-0031","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0031","url":null,"abstract":"<p><p>Accurately measuring epigenetic marks such as 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) at the single-nucleotide level, requires combining data from DNA processing methods including traditional (BS), oxidative (oxBS) or Tet-Assisted (TAB) bisulfite conversion. We introduce the R package MLML2R, which provides maximum likelihood estimates (MLE) of 5-mC and 5-hmC proportions. While all other available R packages provide 5-mC and 5-hmC MLEs only for the oxBS+BS combination, MLML2R also provides MLE for TAB combinations. For combinations of any two of the methods, we derived the pool-adjacent-violators algorithm (PAVA) exact constrained MLE in analytical form. For the three methods combination, we implemented both the iterative method by Qu et al. [Qu, J., M. Zhou, Q. Song, E. E. Hong and A. D. Smith (2013): \"Mlml: consistent simultaneous estimates of dna methylation and hydroxymethylation,\" Bioinformatics, 29, 2645-2646.], and also a novel non iterative approximation using Lagrange multipliers. The newly proposed non iterative solutions greatly decrease computational time, common bottlenecks when processing high-throughput data. The MLML2R package is flexible as it takes as input both, preprocessed intensities from Infinium Methylation arrays and counts from Next Generation Sequencing technologies. The MLML2R package is freely available at https://CRAN.R-project.org/package=MLML2R.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36872982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data 从Hi-C数据中鉴定远距离染色体相互作用的经验贝叶斯方法
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-12-17 DOI: 10.1101/497776
Qi Zhang, Zheng Xu, Yutong Lai
Abstract Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the “true” interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).
近年来,Hi-C实验已成为研究三维基因组结构的热门方法。鉴定远距离染色体相互作用,即峰检测,对Hi-C数据分析至关重要。但由于Hi-C计数数据矩阵固有的高维性、稀疏性和过色散性,这仍然是一项具有挑战性的任务。我们提出了EBHiC,一种从Hi-C数据中检测峰的经验贝叶斯方法。所提出的框架通过明确地包括“真实”相互作用强度作为潜在变量,提供了灵活的过分散建模。为了实现所提出的峰值识别方法(通过经验贝叶斯检验),我们使用平滑期望最大化算法估计观测计数的半参数总体分布,并基于零假设估计经验零。我们进行了大量的模拟来验证和评估我们提出的方法的性能,并将其应用于实际数据集。我们的研究结果表明,EBHiC在准确性、生物可解释性和跨生物重复的一致性方面可以识别出更好的峰。源代码可在Github (https://github.com/QiZhangStat/EBHiC)上获得。
{"title":"An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data","authors":"Qi Zhang, Zheng Xu, Yutong Lai","doi":"10.1101/497776","DOIUrl":"https://doi.org/10.1101/497776","url":null,"abstract":"Abstract Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the “true” interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"20 1","pages":"1 - 15"},"PeriodicalIF":0.9,"publicationDate":"2018-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44544987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
False discovery control for penalized variable selections with high-dimensional covariates. 具有高维协变量的惩罚性变量选择的错误发现控制。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-12-15 DOI: 10.1515/sagmb-2018-0038
Kevin He, Xiang Zhou, Hui Jiang, Xiaoquan Wen, Yi Li

Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.

现代生物技术产生了大量的高通量数据,预测因子的数量远远超过了样本量。惩罚性变量选择已成为一种强大而高效的降维工具。然而,在惩罚性高维变量选择中控制错误发现(即包含无关变量)是一项严峻的挑战。为了有效控制惩罚性变量选择的错误发现率,我们提出了一种错误发现控制程序。所提出的方法具有通用性和灵活性,可用于多种变量选择算法,不仅适用于线性回归,还适用于广义线性模型和生存分析。
{"title":"False discovery control for penalized variable selections with high-dimensional covariates.","authors":"Kevin He, Xiang Zhou, Hui Jiang, Xiaoquan Wen, Yi Li","doi":"10.1515/sagmb-2018-0038","DOIUrl":"10.1515/sagmb-2018-0038","url":null,"abstract":"<p><p>Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6450074/pdf/nihms-1015624.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37050068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1