Abstract Among the large of number of statistical methods that have been proposed to identify gene-gene interactions in case-control genome-wide association studies (GWAS), gene-based methods have recently grown in popularity as they confer advantage in both statistical power and biological interpretation. All of the gene-based methods jointly model the distribution of single nucleotide polymorphisms (SNPs) sets prior to the statistical test, leading to a limited power to detect sums of SNP-SNP signals. In this paper, we instead propose a gene-based method that first performs SNP-SNP interaction tests before aggregating the obtained p-values into a test at the gene level. Our method called AGGrEGATOr is based on a minP procedure that tests the significance of the minimum of a set of p-values. We use simulations to assess the capacity of AGGrEGATOr to correctly control for type-I error. The benefits of our approach in terms of statistical power and robustness to SNPs set characteristics are evaluated in a wide range of disease models by comparing it to previous methods. We also apply our method to detect gene pairs associated to rheumatoid arthritis (RA) on the GSE39428 dataset. We identify 13 potential gene-gene interactions and replicate one gene pair in the Wellcome Trust Case Control Consortium dataset at the level of 5%. We further test 15 gene pairs, previously reported as being statistically associated with RA or Crohn’s disease (CD) or coronary artery disease (CAD), for replication in the Wellcome Trust Case Control Consortium dataset. We show that AGGrEGATOr is the only method able to successfully replicate seven gene pairs.
{"title":"AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies","authors":"M. Emily","doi":"10.1515/sagmb-2015-0074","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0074","url":null,"abstract":"Abstract Among the large of number of statistical methods that have been proposed to identify gene-gene interactions in case-control genome-wide association studies (GWAS), gene-based methods have recently grown in popularity as they confer advantage in both statistical power and biological interpretation. All of the gene-based methods jointly model the distribution of single nucleotide polymorphisms (SNPs) sets prior to the statistical test, leading to a limited power to detect sums of SNP-SNP signals. In this paper, we instead propose a gene-based method that first performs SNP-SNP interaction tests before aggregating the obtained p-values into a test at the gene level. Our method called AGGrEGATOr is based on a minP procedure that tests the significance of the minimum of a set of p-values. We use simulations to assess the capacity of AGGrEGATOr to correctly control for type-I error. The benefits of our approach in terms of statistical power and robustness to SNPs set characteristics are evaluated in a wide range of disease models by comparing it to previous methods. We also apply our method to detect gene pairs associated to rheumatoid arthritis (RA) on the GSE39428 dataset. We identify 13 potential gene-gene interactions and replicate one gene pair in the Wellcome Trust Case Control Consortium dataset at the level of 5%. We further test 15 gene pairs, previously reported as being statistically associated with RA or Crohn’s disease (CD) or coronary artery disease (CAD), for replication in the Wellcome Trust Case Control Consortium dataset. We show that AGGrEGATOr is the only method able to successfully replicate seven gene pairs.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0074","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiqi Cui, Tieming Ji, Jilong Li, J. Cheng, Jing Qiu
Abstract Identifying differentially expressed (DE) genes between different conditions is one of the main goals of RNA-seq data analysis. Although a large amount of RNA-seq data were produced for two-group comparison with small sample sizes at early stage, more and more RNA-seq data are being produced in the setting of complex experimental designs such as split-plot designs and repeated measure designs. Data arising from such experiments are traditionally analyzed by mixed-effects models. Therefore an appropriate statistical approach for analyzing RNA-seq data from such designs should be generalized linear mixed models (GLMM) or similar approaches that allow for random effects. However, common practices for analyzing such data in literature either treat random effects as fixed or completely ignore the experimental design and focus on two-group comparison using partial data. In this paper, we examine the effect of ignoring the random effects when analyzing RNA-seq data. We accomplish this goal by comparing the standard GLMM model to the methods that ignore the random effects through simulation studies and real data analysis. Our studies show that, ignoring random effects in a multi-factor experiment can lead to the increase of the false positives among the top selected genes or lower power when the nominal FDR level is controlled.
{"title":"What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment","authors":"Shiqi Cui, Tieming Ji, Jilong Li, J. Cheng, Jing Qiu","doi":"10.1515/sagmb-2015-0011","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0011","url":null,"abstract":"Abstract Identifying differentially expressed (DE) genes between different conditions is one of the main goals of RNA-seq data analysis. Although a large amount of RNA-seq data were produced for two-group comparison with small sample sizes at early stage, more and more RNA-seq data are being produced in the setting of complex experimental designs such as split-plot designs and repeated measure designs. Data arising from such experiments are traditionally analyzed by mixed-effects models. Therefore an appropriate statistical approach for analyzing RNA-seq data from such designs should be generalized linear mixed models (GLMM) or similar approaches that allow for random effects. However, common practices for analyzing such data in literature either treat random effects as fixed or completely ignore the experimental design and focus on two-group comparison using partial data. In this paper, we examine the effect of ignoring the random effects when analyzing RNA-seq data. We accomplish this goal by comparing the standard GLMM model to the methods that ignore the random effects through simulation studies and real data analysis. Our studies show that, ignoring random effects in a multi-factor experiment can lead to the increase of the false positives among the top selected genes or lower power when the nominal FDR level is controlled.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract We are presenting a comprehensive comparative analysis of five differential methylation (DM) identification methods: methylKit, BSmooth, BiSeq, HMM-DM, and HMM-Fisher, which are developed for bisulfite sequencing (BS) data. We summarize the features of these methods from several analytical aspects and compare their performances using both simulated and real BS datasets. Our comparison results are summarized below. First, parameter settings may largely affect the accuracy of DM identification. Different from default settings, modified parameter settings yield higher sensitivity and/or lower false positive rates. Second, all five methods show more accurate results when identifying simulated DM regions that are long and have small within-group variation, but they have low concordance, probably due to the different approaches they have used for DM identification. Third, HMM-DM and HMM-Fisher yield relatively higher sensitivity and lower false positive rates than others, especially in DM regions with large variation. Finally, we have found that among the three methods that involve methylation estimation (methylKit, BSmooth, and BiSeq), BiSeq can best present raw methylation signals. Therefore, based on these results, we suggest that users select DM identification methods based on the characteristics of their data and the advantages of each method.
{"title":"Comparing five statistical methods of differential methylation identification using bisulfite sequencing data","authors":"Xiaoqing Yu, Shuying Sun","doi":"10.1515/sagmb-2015-0078","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0078","url":null,"abstract":"Abstract We are presenting a comprehensive comparative analysis of five differential methylation (DM) identification methods: methylKit, BSmooth, BiSeq, HMM-DM, and HMM-Fisher, which are developed for bisulfite sequencing (BS) data. We summarize the features of these methods from several analytical aspects and compare their performances using both simulated and real BS datasets. Our comparison results are summarized below. First, parameter settings may largely affect the accuracy of DM identification. Different from default settings, modified parameter settings yield higher sensitivity and/or lower false positive rates. Second, all five methods show more accurate results when identifying simulated DM regions that are long and have small within-group variation, but they have low concordance, probably due to the different approaches they have used for DM identification. Third, HMM-DM and HMM-Fisher yield relatively higher sensitivity and lower false positive rates than others, especially in DM regions with large variation. Finally, we have found that among the three methods that involve methylation estimation (methylKit, BSmooth, and BiSeq), BiSeq can best present raw methylation signals. Therefore, based on these results, we suggest that users select DM identification methods based on the characteristics of their data and the advantages of each method.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0078","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Graphical models have proven to be a valuable tool for connecting genotypes and phenotypes. Structural learning of phenotype-genotype networks has received considerable attention in the post-genome era. In recent years, a dozen different methods have emerged for network inference, which leverage natural variation that arises in certain genetic populations. The structure of the network itself can be used to form hypotheses based on the inferred direct and indirect network relationships, but represents a premature endpoint to the graphical analyses. In this work, we extend this endpoint. We examine the unexplored problem of perturbing a given network structure, and quantifying the system-wide effects on the network in a node-wise manner. The perturbation is achieved through the setting of values of phenotype node(s), which may reflect an inhibition or activation, and propagating this information through the entire network. We leverage belief propagation methods in Conditional Gaussian Bayesian Networks (CG-BNs), in order to absorb and propagate phenotypic evidence through the network. We show that the modeling assumptions adopted for genotype-phenotype networks represent an important sub-class of CG-BNs, which possess properties that ensure exact inference in the propagation scheme. The system-wide effects of the perturbation are quantified in a node-wise manner through the comparison of perturbed and unperturbed marginal distributions using a symmetric Kullback-Leibler divergence. Applications to kidney and skin cancer expression quantitative trait loci (eQTL) data from different mus musculus populations are presented. System-wide effects in the network were predicted and visualized across a spectrum of evidence. Sub-pathways and regions of the network responded in concert, suggesting co-regulation and coordination throughout the network in response to phenotypic changes. We demonstrate how these predicted system-wide effects can be examined in connection with estimated class probabilities for covariates of interest, e.g. cancer status. Despite the uncertainty in the network structure, we demonstrate the system-wide predictions are stable across an ensemble of highly likely networks. A software package, geneNetBP, which implements our approach, was developed in the R programming language.
{"title":"Belief propagation in genotype-phenotype networks","authors":"Janhavi Moharil, Paul May, D. Gaile, R. Blair","doi":"10.1515/sagmb-2015-0058","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0058","url":null,"abstract":"Abstract Graphical models have proven to be a valuable tool for connecting genotypes and phenotypes. Structural learning of phenotype-genotype networks has received considerable attention in the post-genome era. In recent years, a dozen different methods have emerged for network inference, which leverage natural variation that arises in certain genetic populations. The structure of the network itself can be used to form hypotheses based on the inferred direct and indirect network relationships, but represents a premature endpoint to the graphical analyses. In this work, we extend this endpoint. We examine the unexplored problem of perturbing a given network structure, and quantifying the system-wide effects on the network in a node-wise manner. The perturbation is achieved through the setting of values of phenotype node(s), which may reflect an inhibition or activation, and propagating this information through the entire network. We leverage belief propagation methods in Conditional Gaussian Bayesian Networks (CG-BNs), in order to absorb and propagate phenotypic evidence through the network. We show that the modeling assumptions adopted for genotype-phenotype networks represent an important sub-class of CG-BNs, which possess properties that ensure exact inference in the propagation scheme. The system-wide effects of the perturbation are quantified in a node-wise manner through the comparison of perturbed and unperturbed marginal distributions using a symmetric Kullback-Leibler divergence. Applications to kidney and skin cancer expression quantitative trait loci (eQTL) data from different mus musculus populations are presented. System-wide effects in the network were predicted and visualized across a spectrum of evidence. Sub-pathways and regions of the network responded in concert, suggesting co-regulation and coordination throughout the network in response to phenotypic changes. We demonstrate how these predicted system-wide effects can be examined in connection with estimated class probabilities for covariates of interest, e.g. cancer status. Despite the uncertainty in the network structure, we demonstrate the system-wide predictions are stable across an ensemble of highly likely networks. A software package, geneNetBP, which implements our approach, was developed in the R programming language.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0058","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Miecznikowski, D. Gaile, Xiwei Chen, D. Tritchler
Abstract It is often of scientific interest to find a set of genes that may represent an independent functional module or network, such as a functional gene expression module causing a biological response, a transcription regulatory network, or a constellation of mutations jointly causing a disease. In this paper we are specifically interested in identifying modules that control a particular outcome variable such as a disease biomarker. We discuss the statistical properties that functional networks should possess and introduce the concept of network consistency which should be satisfied by real functional networks of cooperating genes, and directly use the concept in the pathway discovery method we present. Our method gives superior performance for all but the simplest functional networks.
{"title":"Identification of consistent functional genetic modules","authors":"J. Miecznikowski, D. Gaile, Xiwei Chen, D. Tritchler","doi":"10.1515/sagmb-2015-0026","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0026","url":null,"abstract":"Abstract It is often of scientific interest to find a set of genes that may represent an independent functional module or network, such as a functional gene expression module causing a biological response, a transcription regulatory network, or a constellation of mutations jointly causing a disease. In this paper we are specifically interested in identifying modules that control a particular outcome variable such as a disease biomarker. We discuss the statistical properties that functional networks should possess and introduce the concept of network consistency which should be satisfied by real functional networks of cooperating genes, and directly use the concept in the pathway discovery method we present. Our method gives superior performance for all but the simplest functional networks.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract DNA methylation is an epigenetic modification involved in organism development and cellular differentiation. Identifying differential methylations can help to study genomic regions associated with diseases. Differential methylation studies on single-CG resolution have become possible with the bisulfite sequencing (BS) technology. However, there is still a lack of efficient statistical methods for identifying differentially methylated (DM) regions in BS data. We have developed a new approach named HMM-DM to detect DM regions between two biological conditions using BS data. This new approach first uses a hidden Markov model (HMM) to identify DM CG sites accounting for spatial correlation across CG sites and variation across samples, and then summarizes identified sites into regions. We demonstrate through a simulation study that our approach has a superior performance compared to BSmooth. We also illustrate the application of HMM-DM using a real breast cancer dataset.
{"title":"HMM-DM: identifying differentially methylated regions using a hidden Markov model","authors":"Xiaoqing Yu, Shuying Sun","doi":"10.1515/sagmb-2015-0077","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0077","url":null,"abstract":"Abstract DNA methylation is an epigenetic modification involved in organism development and cellular differentiation. Identifying differential methylations can help to study genomic regions associated with diseases. Differential methylation studies on single-CG resolution have become possible with the bisulfite sequencing (BS) technology. However, there is still a lack of efficient statistical methods for identifying differentially methylated (DM) regions in BS data. We have developed a new approach named HMM-DM to detect DM regions between two biological conditions using BS data. This new approach first uses a hidden Markov model (HMM) to identify DM CG sites accounting for spatial correlation across CG sites and variation across samples, and then summarizes identified sites into regions. We demonstrate through a simulation study that our approach has a superior performance compared to BSmooth. We also illustrate the application of HMM-DM using a real breast cancer dataset.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract DNA methylation is an epigenetic event that plays an important role in regulating gene expression. It is important to study DNA methylation, especially differential methylation patterns between two groups of samples (e.g. patients vs. normal individuals). With next generation sequencing technologies, it is now possible to identify differential methylation patterns by considering methylation at the single CG site level in an entire genome. However, it is challenging to analyze large and complex NGS data. In order to address this difficult question, we have developed a new statistical method using a hidden Markov model and Fisher’s exact test (HMM-Fisher) to identify differentially methylated cytosines and regions. We first use a hidden Markov chain to model the methylation signals to infer the methylation state as Not methylated (N), Partly methylated (P), and Fully methylated (F) for each individual sample. We then use Fisher’s exact test to identify differentially methylated CG sites. We show the HMM-Fisher method and compare it with commonly cited methods using both simulated data and real sequencing data. The results show that HMM-Fisher outperforms the current available methods to which we have compared. HMM-Fisher is efficient and robust in identifying heterogeneous DM regions.
{"title":"HMM-Fisher: identifying differential methylation using a hidden Markov model and Fisher’s exact test","authors":"Shuying Sun, Xiaoqing Yu","doi":"10.1515/sagmb-2015-0076","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0076","url":null,"abstract":"Abstract DNA methylation is an epigenetic event that plays an important role in regulating gene expression. It is important to study DNA methylation, especially differential methylation patterns between two groups of samples (e.g. patients vs. normal individuals). With next generation sequencing technologies, it is now possible to identify differential methylation patterns by considering methylation at the single CG site level in an entire genome. However, it is challenging to analyze large and complex NGS data. In order to address this difficult question, we have developed a new statistical method using a hidden Markov model and Fisher’s exact test (HMM-Fisher) to identify differentially methylated cytosines and regions. We first use a hidden Markov chain to model the methylation signals to infer the methylation state as Not methylated (N), Partly methylated (P), and Fully methylated (F) for each individual sample. We then use Fisher’s exact test to identify differentially methylated CG sites. We show the HMM-Fisher method and compare it with commonly cited methods using both simulated data and real sequencing data. The results show that HMM-Fisher outperforms the current available methods to which we have compared. HMM-Fisher is efficient and robust in identifying heterogeneous DM regions.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0076","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samuel A. Mason, Faiz Sayyid, Paul D. W. Kirk, Colin Starr, D. Wild
Abstract The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/.
{"title":"MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing","authors":"Samuel A. Mason, Faiz Sayyid, Paul D. W. Kirk, Colin Starr, D. Wild","doi":"10.1515/sagmb-2015-0055","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0055","url":null,"abstract":"Abstract The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2016-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0055","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Gelfond, J. Ibrahim, Ming-Hui Chen, Wei Sun, Kaitlyn N. Lewis, Sean Kinahan, Matthew A. Hibbs, R. Buffenstein
Abstract There is an increasing demand for exploration of the transcriptomes of multiple species with extraordinary traits such as the naked-mole rat (NMR). The NMR is remarkable because of its longevity and resistance to developing cancer. It is of scientific interest to understand the molecular mechanisms that impart these traits, and RNA-sequencing experiments with comparator species can correlate transcriptome dynamics with these phenotypes. Comparing transcriptome differences requires a homology mapping of each transcript in one species to transcript(s) within the other. Such mappings are necessary, especially if one species does not have well-annotated genome available. Current approaches for this type of analysis typically identify the best match for each transcript, but the best match analysis ignores the inherent risks of mismatch when there are multiple candidate transcripts with similar homology scores. We present a method that treats the set of homologs from a novel species as a cluster corresponding to a single gene in the reference species, and we compare the cluster-based approach to a conventional best-match analysis in both simulated data and a case study with NMR and mouse tissues. We demonstrate that the cluster-based approach has superior power to detect differential expression.
{"title":"Homology cluster differential expression analysis for interspecies mRNA-Seq experiments","authors":"J. Gelfond, J. Ibrahim, Ming-Hui Chen, Wei Sun, Kaitlyn N. Lewis, Sean Kinahan, Matthew A. Hibbs, R. Buffenstein","doi":"10.1515/sagmb-2014-0056","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0056","url":null,"abstract":"Abstract There is an increasing demand for exploration of the transcriptomes of multiple species with extraordinary traits such as the naked-mole rat (NMR). The NMR is remarkable because of its longevity and resistance to developing cancer. It is of scientific interest to understand the molecular mechanisms that impart these traits, and RNA-sequencing experiments with comparator species can correlate transcriptome dynamics with these phenotypes. Comparing transcriptome differences requires a homology mapping of each transcript in one species to transcript(s) within the other. Such mappings are necessary, especially if one species does not have well-annotated genome available. Current approaches for this type of analysis typically identify the best match for each transcript, but the best match analysis ignores the inherent risks of mismatch when there are multiple candidate transcripts with similar homology scores. We present a method that treats the set of homologs from a novel species as a cluster corresponding to a single gene in the reference species, and we compare the cluster-based approach to a conventional best-match analysis in both simulated data and a case study with NMR and mouse tissues. We demonstrate that the cluster-based approach has superior power to detect differential expression.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Splitting extended families into their component nuclear families to apply a genetic association method designed for nuclear families is a widespread practice in familial genetic studies. Dependence among genotypes and phenotypes of nuclear families from the same extended family arises because of genetic linkage of the tested marker with a risk variant or because of familial specificity of genetic effects due to gene-environment interaction. This raises concerns about the validity of inference conducted under the assumption of independence of the nuclear families. We indeed prove theoretically that, in a conditional logistic regression analysis applicable to disease cases and their genotyped parents, the naive model-based estimator of the variance of the coefficient estimates underestimates the true variance. However, simulations with realistic effect sizes of risk variants and variation of this effect from family to family reveal that the underestimation is negligible. The simulations also show the greater efficiency of the model-based variance estimator compared to a robust empirical estimator. Our recommendation is therefore, to use the model-based estimator of variance for inference on effects of genetic variants.
{"title":"On the validity of within-nuclear-family genetic association analysis in samples of extended families","authors":"A. Bureau, T. Duchesne","doi":"10.1515/sagmb-2015-0056","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0056","url":null,"abstract":"Abstract Splitting extended families into their component nuclear families to apply a genetic association method designed for nuclear families is a widespread practice in familial genetic studies. Dependence among genotypes and phenotypes of nuclear families from the same extended family arises because of genetic linkage of the tested marker with a risk variant or because of familial specificity of genetic effects due to gene-environment interaction. This raises concerns about the validity of inference conducted under the assumption of independence of the nuclear families. We indeed prove theoretically that, in a conditional logistic regression analysis applicable to disease cases and their genotyped parents, the naive model-based estimator of the variance of the coefficient estimates underestimates the true variance. However, simulations with realistic effect sizes of risk variants and variation of this effect from family to family reveal that the underestimation is negligible. The simulations also show the greater efficiency of the model-based variance estimator compared to a robust empirical estimator. Our recommendation is therefore, to use the model-based estimator of variance for inference on effects of genetic variants.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}