Gail Gong, Wei Wang, Chih-Lin Hsieh, David J Van Den Berg, Christopher Haiman, Ingrid Oakley-Girvan, Alice S Whittemore
Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) .
{"title":"Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships.","authors":"Gail Gong, Wei Wang, Chih-Lin Hsieh, David J Van Den Berg, Christopher Haiman, Ingrid Oakley-Girvan, Alice S Whittemore","doi":"10.1515/sagmb-2018-0030","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0030","url":null,"abstract":"<p><p>Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) <https://gailg.github.io/gatars/>.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37127926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trishanta Padayachee, Tatsiana Khamiakova, Ziv Shkedy, Perttu Salo, Markus Perola, Tomasz Burzykowski
A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene co-expression (i.e. gene-pair correlations). Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorizing a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, we use a general linear model (GLM) for correlated data to study the relationship between gene-module co-expression and a covariate like metabolite concentration. The GLM specifies the gene-pair correlations as a function of the continuous covariate. The use of the GLM allows for investigating different (linear and non-linear) patterns of co-expression. Furthermore, the modeling approach offers a formal framework for testing hypotheses about possible patterns of co-expression. In our paper, a simulation study is used to assess the performance of the GLM. The performance is compared with that of a previously proposed GLM that utilizes categorized covariates. The versatility of the model is illustrated by using a real-life example. We discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting of the proposed model.
{"title":"A multivariate linear model for investigating the association between gene-module co-expression and a continuous covariate.","authors":"Trishanta Padayachee, Tatsiana Khamiakova, Ziv Shkedy, Perttu Salo, Markus Perola, Tomasz Burzykowski","doi":"10.1515/sagmb-2018-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0008","url":null,"abstract":"<p><p>A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene co-expression (i.e. gene-pair correlations). Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorizing a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, we use a general linear model (GLM) for correlated data to study the relationship between gene-module co-expression and a covariate like metabolite concentration. The GLM specifies the gene-pair correlations as a function of the continuous covariate. The use of the GLM allows for investigating different (linear and non-linear) patterns of co-expression. Furthermore, the modeling approach offers a formal framework for testing hypotheses about possible patterns of co-expression. In our paper, a simulation study is used to assess the performance of the GLM. The performance is compared with that of a previously proposed GLM that utilizes categorized covariates. The versatility of the model is illustrated by using a real-life example. We discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting of the proposed model.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37234437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian Schmich, Jack Kuipers, Gunter Merdes, Niko Beerenwinkel
In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene-gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.
{"title":"netprioR: a probabilistic model for integrative hit prioritisation of genetic screens.","authors":"Fabian Schmich, Jack Kuipers, Gunter Merdes, Niko Beerenwinkel","doi":"10.1515/sagmb-2018-0033","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0033","url":null,"abstract":"<p><p>In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene-gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37204570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.
{"title":"Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences.","authors":"Hsin-Hsiung Huang, Senthil Balaji Girimurugan","doi":"10.1515/sagmb-2018-0045","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0045","url":null,"abstract":"Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36963300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiehuan Sun, Jose D Herazo-Maya, Jane-Ling Wang, Naftali Kaminski, Hongyu Zhao
Longitudinal genomics data and survival outcome are common in biomedical studies, where the genomics data are often of high dimension. It is of great interest to select informative longitudinal biomarkers (e.g. genes) related to the survival outcome. In this paper, we develop a computationally efficient tool, LCox, for selecting informative biomarkers related to the survival outcome using the longitudinal genomics data. LCox is powerful to detect different forms of dependence between the longitudinal biomarkers and the survival outcome. We show that LCox has improved performance compared to existing methods through extensive simulation studies. In addition, by applying LCox to a dataset of patients with idiopathic pulmonary fibrosis, we are able to identify biologically meaningful genes while all other methods fail to make any discovery. An R package to perform LCox is freely available at https://CRAN.R-project.org/package=LCox.
{"title":"LCox: a tool for selecting genes related to survival outcomes using longitudinal gene expression data.","authors":"Jiehuan Sun, Jose D Herazo-Maya, Jane-Ling Wang, Naftali Kaminski, Hongyu Zhao","doi":"10.1515/sagmb-2017-0060","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0060","url":null,"abstract":"<p><p>Longitudinal genomics data and survival outcome are common in biomedical studies, where the genomics data are often of high dimension. It is of great interest to select informative longitudinal biomarkers (e.g. genes) related to the survival outcome. In this paper, we develop a computationally efficient tool, LCox, for selecting informative biomarkers related to the survival outcome using the longitudinal genomics data. LCox is powerful to detect different forms of dependence between the longitudinal biomarkers and the survival outcome. We show that LCox has improved performance compared to existing methods through extensive simulation studies. In addition, by applying LCox to a dataset of patients with idiopathic pulmonary fibrosis, we are able to identify biologically meaningful genes while all other methods fail to make any discovery. An R package to perform LCox is freely available at https://CRAN.R-project.org/package=LCox.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0060","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36962842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tyler G Kinzy, Timothy K Starr, George C Tseng, Yen-Yi Ho
Methods for exploring genetic interactions have been developed in an attempt to move beyond single gene analyses. Because biological molecules frequently participate in different processes under various cellular conditions, investigating the changes in gene coexpression patterns under various biological conditions could reveal important regulatory mechanisms. One of the methods for capturing gene coexpression dynamics, named liquid association (LA), quantifies the relationship where the coexpression between two genes is modulated by a third "coordinator" gene. This LA measure offers a natural framework for studying gene coexpression changes and has been applied increasingly to study regulatory networks among genes. With a wealth of publicly available gene expression data, there is a need to develop a meta-analytic framework for LA analysis. In this paper, we incorporated mixed effects when modeling correlation to account for between-studies heterogeneity. For statistical inference about LA, we developed a Markov chain Monte Carlo (MCMC) estimation procedure through a Bayesian hierarchical framework. We evaluated the proposed methods in a set of simulations and illustrated their use in two collections of experimental data sets. The first data set combined 10 pancreatic ductal adenocarcinoma gene expression studies to determine the role of possible coordinator gene USP9X in the Hippo pathway. The second experimental data set consisted of 907 gene expression microarray Escherichia coli experiments from multiple studies publicly available through the Many Microbe Microarray Database website (http://m3d.bu.edu/) and examined genes that coexpress with serA in the presence of coordinator gene Lrp.
为了超越单基因分析,人们开发了探索基因相互作用的方法。由于生物分子经常参与各种细胞条件下的不同过程,因此研究各种生物条件下基因共表达模式的变化可以揭示重要的调控机制。其中一种捕捉基因共表达动态的方法被命名为液体关联(LA),它量化了两个基因之间的共表达受第三个 "协调 "基因调节的关系。液态关联测量为研究基因共表达变化提供了一个自然框架,并越来越多地被应用于研究基因间的调控网络。随着大量基因表达数据的公开,有必要为 LA 分析开发一个元分析框架。在本文中,我们在建立相关性模型时加入了混合效应,以考虑研究间的异质性。为了对 LA 进行统计推断,我们通过贝叶斯分层框架开发了马尔科夫链蒙特卡罗(MCMC)估计程序。我们在一组模拟中评估了所提出的方法,并在两组实验数据中说明了这些方法的用途。第一个数据集结合了 10 个胰腺导管腺癌基因表达研究,以确定可能的协调基因 USP9X 在 Hippo 通路中的作用。第二个实验数据集包括 907 个基因表达微阵列大肠杆菌实验,这些实验来自多项研究,可通过许多微生物微阵列数据库网站(http://m3d.bu.edu/)公开获取,并研究了在协调基因 Lrp 存在的情况下与 serA 共同表达的基因。
{"title":"Meta-analytic framework for modeling genetic coexpression dynamics.","authors":"Tyler G Kinzy, Timothy K Starr, George C Tseng, Yen-Yi Ho","doi":"10.1515/sagmb-2017-0052","DOIUrl":"10.1515/sagmb-2017-0052","url":null,"abstract":"<p><p>Methods for exploring genetic interactions have been developed in an attempt to move beyond single gene analyses. Because biological molecules frequently participate in different processes under various cellular conditions, investigating the changes in gene coexpression patterns under various biological conditions could reveal important regulatory mechanisms. One of the methods for capturing gene coexpression dynamics, named liquid association (LA), quantifies the relationship where the coexpression between two genes is modulated by a third \"coordinator\" gene. This LA measure offers a natural framework for studying gene coexpression changes and has been applied increasingly to study regulatory networks among genes. With a wealth of publicly available gene expression data, there is a need to develop a meta-analytic framework for LA analysis. In this paper, we incorporated mixed effects when modeling correlation to account for between-studies heterogeneity. For statistical inference about LA, we developed a Markov chain Monte Carlo (MCMC) estimation procedure through a Bayesian hierarchical framework. We evaluated the proposed methods in a set of simulations and illustrated their use in two collections of experimental data sets. The first data set combined 10 pancreatic ductal adenocarcinoma gene expression studies to determine the role of possible coordinator gene USP9X in the Hippo pathway. The second experimental data set consisted of 907 gene expression microarray Escherichia coli experiments from multiple studies publicly available through the Many Microbe Microarray Database website (http://m3d.bu.edu/) and examined genes that coexpress with serA in the presence of coordinator gene Lrp.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36944546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.
{"title":"Sliced inverse regression for integrative multi-omics data analysis.","authors":"Yashita Jain, Shanshan Ding, Jing Qiu","doi":"10.1515/sagmb-2018-0028","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0028","url":null,"abstract":"<p><p>Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0028","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36901134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Xue, Jinjuan Wang, Juan Ding, Sanguo Zhang, Qizhai Li
Response selective sampling design is commonly adopted in genetic epidemiologic study because it can substantially reduce time cost and increase power of identifying deleterious genetic variants predispose to human complex disease comparing with prospective design. The proportional odds model (POM) can be used to fit data obtained by this design. Unlike the logistic regression model, the estimated genetic effect based on POM by taking data as being enrolled prospectively is inconsistent. So the power of resulted Wald test is not satisfactory. The modified POM is suitable to fit this type of data, however, the corresponding Wald test is not optimal when the genetic effect is small. Here, we propose a new association test to handle this issue. Simulation studies show that the proposed test can control the type I error rate correctly and is more powerful than two existing methods. Finally, we applied three tests to Anticyclic Citrullinated Protein Antibody data from Genetic Workshop 16.
{"title":"A powerful test for ordinal trait genetic association analysis.","authors":"Yuan Xue, Jinjuan Wang, Juan Ding, Sanguo Zhang, Qizhai Li","doi":"10.1515/sagmb-2017-0066","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0066","url":null,"abstract":"<p><p>Response selective sampling design is commonly adopted in genetic epidemiologic study because it can substantially reduce time cost and increase power of identifying deleterious genetic variants predispose to human complex disease comparing with prospective design. The proportional odds model (POM) can be used to fit data obtained by this design. Unlike the logistic regression model, the estimated genetic effect based on POM by taking data as being enrolled prospectively is inconsistent. So the power of resulted Wald test is not satisfactory. The modified POM is suitable to fit this type of data, however, the corresponding Wald test is not optimal when the genetic effect is small. Here, we propose a new association test to handle this issue. Simulation studies show that the proposed test can control the type I error rate correctly and is more powerful than two existing methods. Finally, we applied three tests to Anticyclic Citrullinated Protein Antibody data from Genetic Workshop 16.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36901132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohong Li, Dongfeng Wu, Nigel G F Cooper, Shesh N Rai
High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.
{"title":"Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model.","authors":"Xiaohong Li, Dongfeng Wu, Nigel G F Cooper, Shesh N Rai","doi":"10.1515/sagmb-2018-0021","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0021","url":null,"abstract":"<p><p>High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0021","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36885647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samara F Kiihl, Maria Jose Martinez-Garrido, Arce Domingo-Relloso, Jose Bermudez, Maria Tellez-Plaza
Accurately measuring epigenetic marks such as 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) at the single-nucleotide level, requires combining data from DNA processing methods including traditional (BS), oxidative (oxBS) or Tet-Assisted (TAB) bisulfite conversion. We introduce the R package MLML2R, which provides maximum likelihood estimates (MLE) of 5-mC and 5-hmC proportions. While all other available R packages provide 5-mC and 5-hmC MLEs only for the oxBS+BS combination, MLML2R also provides MLE for TAB combinations. For combinations of any two of the methods, we derived the pool-adjacent-violators algorithm (PAVA) exact constrained MLE in analytical form. For the three methods combination, we implemented both the iterative method by Qu et al. [Qu, J., M. Zhou, Q. Song, E. E. Hong and A. D. Smith (2013): "Mlml: consistent simultaneous estimates of dna methylation and hydroxymethylation," Bioinformatics, 29, 2645-2646.], and also a novel non iterative approximation using Lagrange multipliers. The newly proposed non iterative solutions greatly decrease computational time, common bottlenecks when processing high-throughput data. The MLML2R package is flexible as it takes as input both, preprocessed intensities from Infinium Methylation arrays and counts from Next Generation Sequencing technologies. The MLML2R package is freely available at https://CRAN.R-project.org/package=MLML2R.
在单核苷酸水平上精确测量5-甲基胞嘧啶(5-mC)和5-羟甲基胞嘧啶(5-hmC)等表观遗传标记,需要结合DNA处理方法的数据,包括传统(BS),氧化(oxBS)或et辅助(TAB)亚硫酸氢盐转化。我们介绍了R包MLML2R,它提供了5-mC和5-hmC比例的最大似然估计(MLE)。虽然所有其他可用的R包仅为oxBS+BS组合提供5-mC和5-hmC MLE,但MLML2R还为TAB组合提供了MLE。对于任意两种方法的组合,我们以解析形式导出了池邻接违反者算法(PAVA)的精确约束MLE。对于这三种方法的组合,我们实现了Qu等人的迭代方法[Qu, J, M. Zhou, Q. Song, E. E. Hong and A. D. Smith(2013):“Mlml: dna甲基化和羟甲基化的一致同时估计”,生物信息学,29,2645-2646。],以及使用拉格朗日乘法器的一种新颖的非迭代近似。新提出的非迭代解决方案大大减少了处理高吞吐量数据时常见的计算时间瓶颈。MLML2R封装是灵活的,因为它需要输入,来自Infinium甲基化阵列的预处理强度和来自下一代测序技术的计数。MLML2R包可在https://CRAN.R-project.org/package=MLML2R免费获得。
{"title":"MLML2R: an R package for maximum likelihood estimation of DNA methylation and hydroxymethylation proportions.","authors":"Samara F Kiihl, Maria Jose Martinez-Garrido, Arce Domingo-Relloso, Jose Bermudez, Maria Tellez-Plaza","doi":"10.1515/sagmb-2018-0031","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0031","url":null,"abstract":"<p><p>Accurately measuring epigenetic marks such as 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) at the single-nucleotide level, requires combining data from DNA processing methods including traditional (BS), oxidative (oxBS) or Tet-Assisted (TAB) bisulfite conversion. We introduce the R package MLML2R, which provides maximum likelihood estimates (MLE) of 5-mC and 5-hmC proportions. While all other available R packages provide 5-mC and 5-hmC MLEs only for the oxBS+BS combination, MLML2R also provides MLE for TAB combinations. For combinations of any two of the methods, we derived the pool-adjacent-violators algorithm (PAVA) exact constrained MLE in analytical form. For the three methods combination, we implemented both the iterative method by Qu et al. [Qu, J., M. Zhou, Q. Song, E. E. Hong and A. D. Smith (2013): \"Mlml: consistent simultaneous estimates of dna methylation and hydroxymethylation,\" Bioinformatics, 29, 2645-2646.], and also a novel non iterative approximation using Lagrange multipliers. The newly proposed non iterative solutions greatly decrease computational time, common bottlenecks when processing high-throughput data. The MLML2R package is flexible as it takes as input both, preprocessed intensities from Infinium Methylation arrays and counts from Next Generation Sequencing technologies. The MLML2R package is freely available at https://CRAN.R-project.org/package=MLML2R.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36872982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}