Statistical Applications in Genetics and Molecular Biology最新文献_第10页

AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies AGGrEGATOr:一种基于基因的基因-基因相互作用试验，用于病例对照关联研究

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2015-0074

M. Emily

Abstract Among the large of number of statistical methods that have been proposed to identify gene-gene interactions in case-control genome-wide association studies (GWAS), gene-based methods have recently grown in popularity as they confer advantage in both statistical power and biological interpretation. All of the gene-based methods jointly model the distribution of single nucleotide polymorphisms (SNPs) sets prior to the statistical test, leading to a limited power to detect sums of SNP-SNP signals. In this paper, we instead propose a gene-based method that first performs SNP-SNP interaction tests before aggregating the obtained p-values into a test at the gene level. Our method called AGGrEGATOr is based on a minP procedure that tests the significance of the minimum of a set of p-values. We use simulations to assess the capacity of AGGrEGATOr to correctly control for type-I error. The benefits of our approach in terms of statistical power and robustness to SNPs set characteristics are evaluated in a wide range of disease models by comparing it to previous methods. We also apply our method to detect gene pairs associated to rheumatoid arthritis (RA) on the GSE39428 dataset. We identify 13 potential gene-gene interactions and replicate one gene pair in the Wellcome Trust Case Control Consortium dataset at the level of 5%. We further test 15 gene pairs, previously reported as being statistically associated with RA or Crohn’s disease (CD) or coronary artery disease (CAD), for replication in the Wellcome Trust Case Control Consortium dataset. We show that AGGrEGATOr is the only method able to successfully replicate seven gene pairs.

在病例对照全基因组关联研究(GWAS)中，已经提出了大量用于识别基因-基因相互作用的统计方法，其中基于基因的方法最近越来越受欢迎，因为它们在统计能力和生物学解释方面都具有优势。所有基于基因的方法都是在统计检验之前联合建模单核苷酸多态性(snp)集的分布，导致检测SNP-SNP信号数量的能力有限。在本文中，我们提出了一种基于基因的方法，首先进行SNP-SNP相互作用测试，然后将获得的p值聚合到基因水平的测试中。我们称为AGGrEGATOr的方法基于一个minP过程，该过程测试一组p值的最小值的显著性。我们使用仿真来评估AGGrEGATOr正确控制i型错误的能力。通过与以前的方法进行比较，我们的方法在统计能力和对snp集特征的鲁棒性方面的优势在广泛的疾病模型中得到了评估。我们还应用我们的方法在GSE39428数据集上检测与类风湿关节炎(RA)相关的基因对。我们确定了13个潜在的基因-基因相互作用，并在Wellcome Trust病例控制联盟数据集中以5%的水平复制了一个基因对。我们进一步测试了15对基因对，这些基因对先前报道与RA或克罗恩病(CD)或冠状动脉疾病(CAD)有统计学相关性，并在威康信托病例控制联盟数据集中进行了复制。我们发现AGGrEGATOr是唯一能够成功复制7对基因的方法。

{"title":"AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies","authors":"M. Emily","doi":"10.1515/sagmb-2015-0074","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0074","url":null,"abstract":"Abstract Among the large of number of statistical methods that have been proposed to identify gene-gene interactions in case-control genome-wide association studies (GWAS), gene-based methods have recently grown in popularity as they confer advantage in both statistical power and biological interpretation. All of the gene-based methods jointly model the distribution of single nucleotide polymorphisms (SNPs) sets prior to the statistical test, leading to a limited power to detect sums of SNP-SNP signals. In this paper, we instead propose a gene-based method that first performs SNP-SNP interaction tests before aggregating the obtained p-values into a test at the gene level. Our method called AGGrEGATOr is based on a minP procedure that tests the significance of the minimum of a set of p-values. We use simulations to assess the capacity of AGGrEGATOr to correctly control for type-I error. The benefits of our approach in terms of statistical power and robustness to SNPs set characteristics are evaluated in a wide range of disease models by comparing it to previous methods. We also apply our method to detect gene pairs associated to rheumatoid arthritis (RA) on the GSE39428 dataset. We identify 13 potential gene-gene interactions and replicate one gene pair in the Wellcome Trust Case Control Consortium dataset at the level of 5%. We further test 15 gene pairs, previously reported as being statistically associated with RA or Crohn’s disease (CD) or coronary artery disease (CAD), for replication in the Wellcome Trust Case Control Consortium dataset. We show that AGGrEGATOr is the only method able to successfully replicate seven gene pairs.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"151 - 171"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0074","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment 如果我们在多因素实验中分析RNA-seq数据时忽略随机效应会怎样

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2015-0011

Shiqi Cui, Tieming Ji, Jilong Li, J. Cheng, Jing Qiu

Abstract Identifying differentially expressed (DE) genes between different conditions is one of the main goals of RNA-seq data analysis. Although a large amount of RNA-seq data were produced for two-group comparison with small sample sizes at early stage, more and more RNA-seq data are being produced in the setting of complex experimental designs such as split-plot designs and repeated measure designs. Data arising from such experiments are traditionally analyzed by mixed-effects models. Therefore an appropriate statistical approach for analyzing RNA-seq data from such designs should be generalized linear mixed models (GLMM) or similar approaches that allow for random effects. However, common practices for analyzing such data in literature either treat random effects as fixed or completely ignore the experimental design and focus on two-group comparison using partial data. In this paper, we examine the effect of ignoring the random effects when analyzing RNA-seq data. We accomplish this goal by comparing the standard GLMM model to the methods that ignore the random effects through simulation studies and real data analysis. Our studies show that, ignoring random effects in a multi-factor experiment can lead to the increase of the false positives among the top selected genes or lower power when the nominal FDR level is controlled.

识别不同条件下的差异表达(DE)基因是RNA-seq数据分析的主要目标之一。虽然在早期的两组小样本量比较中产生了大量的RNA-seq数据，但越来越多的RNA-seq数据是在复杂的实验设计中产生的，如分裂图设计和重复测量设计。这类实验产生的数据传统上是用混合效应模型来分析的。因此，分析来自此类设计的RNA-seq数据的适当统计方法应该是广义线性混合模型(GLMM)或允许随机效应的类似方法。然而，文献中分析这类数据的常见做法，要么将随机效应视为固定效应，要么完全忽略实验设计，只关注使用部分数据的两组比较。在本文中，我们研究忽略随机效应在分析RNA-seq数据时的影响。我们通过模拟研究和实际数据分析，将标准GLMM模型与忽略随机效应的方法进行了比较，从而实现了这一目标。我们的研究表明，在多因素实验中，忽略随机效应可能导致首选基因的假阳性增加，或者在名义FDR水平受到控制时导致功率降低。

{"title":"What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment","authors":"Shiqi Cui, Tieming Ji, Jilong Li, J. Cheng, Jing Qiu","doi":"10.1515/sagmb-2015-0011","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0011","url":null,"abstract":"Abstract Identifying differentially expressed (DE) genes between different conditions is one of the main goals of RNA-seq data analysis. Although a large amount of RNA-seq data were produced for two-group comparison with small sample sizes at early stage, more and more RNA-seq data are being produced in the setting of complex experimental designs such as split-plot designs and repeated measure designs. Data arising from such experiments are traditionally analyzed by mixed-effects models. Therefore an appropriate statistical approach for analyzing RNA-seq data from such designs should be generalized linear mixed models (GLMM) or similar approaches that allow for random effects. However, common practices for analyzing such data in literature either treat random effects as fixed or completely ignore the experimental design and focus on two-group comparison using partial data. In this paper, we examine the effect of ignoring the random effects when analyzing RNA-seq data. We accomplish this goal by comparing the standard GLMM model to the methods that ignore the random effects through simulation studies and real data analysis. Our studies show that, ignoring random effects in a multi-factor experiment can lead to the increase of the false positives among the top selected genes or lower power when the nominal FDR level is controlled.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"105 - 87"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Comparing five statistical methods of differential methylation identification using bisulfite sequencing data 亚硫酸酯测序数据差异甲基化鉴定的五种统计方法比较

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2015-0078

Xiaoqing Yu, Shuying Sun

Abstract We are presenting a comprehensive comparative analysis of five differential methylation (DM) identification methods: methylKit, BSmooth, BiSeq, HMM-DM, and HMM-Fisher, which are developed for bisulfite sequencing (BS) data. We summarize the features of these methods from several analytical aspects and compare their performances using both simulated and real BS datasets. Our comparison results are summarized below. First, parameter settings may largely affect the accuracy of DM identification. Different from default settings, modified parameter settings yield higher sensitivity and/or lower false positive rates. Second, all five methods show more accurate results when identifying simulated DM regions that are long and have small within-group variation, but they have low concordance, probably due to the different approaches they have used for DM identification. Third, HMM-DM and HMM-Fisher yield relatively higher sensitivity and lower false positive rates than others, especially in DM regions with large variation. Finally, we have found that among the three methods that involve methylation estimation (methylKit, BSmooth, and BiSeq), BiSeq can best present raw methylation signals. Therefore, based on these results, we suggest that users select DM identification methods based on the characteristics of their data and the advantages of each method.

摘要:我们对亚硫酸酯测序(BS)数据开发的五种差异甲基化(DM)鉴定方法:methylKit、BSmooth、BiSeq、HMM-DM和HMM-Fisher进行了全面的比较分析。我们从几个分析方面总结了这些方法的特点，并使用模拟和真实BS数据集比较了它们的性能。我们的比较结果总结如下。首先，参数设置可能在很大程度上影响DM识别的准确性。与默认设置不同，修改后的参数设置产生更高的灵敏度和/或更低的误报率。其次，在识别长且组内变异小的模拟糖尿病区域时，这五种方法都显示出更准确的结果，但它们的一致性较低，可能是由于它们用于糖尿病识别的方法不同。第三，HMM-DM和HMM-Fisher的敏感性相对较高，假阳性率相对较低，特别是在变异较大的DM区域。最后，我们发现在涉及甲基化估计的三种方法(methylKit, BSmooth和BiSeq)中，BiSeq可以最好地呈现原始甲基化信号。因此，基于这些结果，我们建议用户根据其数据的特征和每种方法的优势来选择DM识别方法。

{"title":"Comparing five statistical methods of differential methylation identification using bisulfite sequencing data","authors":"Xiaoqing Yu, Shuying Sun","doi":"10.1515/sagmb-2015-0078","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0078","url":null,"abstract":"Abstract We are presenting a comprehensive comparative analysis of five differential methylation (DM) identification methods: methylKit, BSmooth, BiSeq, HMM-DM, and HMM-Fisher, which are developed for bisulfite sequencing (BS) data. We summarize the features of these methods from several analytical aspects and compare their performances using both simulated and real BS datasets. Our comparison results are summarized below. First, parameter settings may largely affect the accuracy of DM identification. Different from default settings, modified parameter settings yield higher sensitivity and/or lower false positive rates. Second, all five methods show more accurate results when identifying simulated DM regions that are long and have small within-group variation, but they have low concordance, probably due to the different approaches they have used for DM identification. Third, HMM-DM and HMM-Fisher yield relatively higher sensitivity and lower false positive rates than others, especially in DM regions with large variation. Finally, we have found that among the three methods that involve methylation estimation (methylKit, BSmooth, and BiSeq), BiSeq can best present raw methylation signals. Therefore, based on these results, we suggest that users select DM identification methods based on the characteristics of their data and the advantages of each method.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"173 - 191"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0078","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Belief propagation in genotype-phenotype networks 基因型-表现型网络中的信念传播

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-03-01 DOI: 10.1515/sagmb-2015-0058

Janhavi Moharil, Paul May, D. Gaile, R. Blair

Abstract Graphical models have proven to be a valuable tool for connecting genotypes and phenotypes. Structural learning of phenotype-genotype networks has received considerable attention in the post-genome era. In recent years, a dozen different methods have emerged for network inference, which leverage natural variation that arises in certain genetic populations. The structure of the network itself can be used to form hypotheses based on the inferred direct and indirect network relationships, but represents a premature endpoint to the graphical analyses. In this work, we extend this endpoint. We examine the unexplored problem of perturbing a given network structure, and quantifying the system-wide effects on the network in a node-wise manner. The perturbation is achieved through the setting of values of phenotype node(s), which may reflect an inhibition or activation, and propagating this information through the entire network. We leverage belief propagation methods in Conditional Gaussian Bayesian Networks (CG-BNs), in order to absorb and propagate phenotypic evidence through the network. We show that the modeling assumptions adopted for genotype-phenotype networks represent an important sub-class of CG-BNs, which possess properties that ensure exact inference in the propagation scheme. The system-wide effects of the perturbation are quantified in a node-wise manner through the comparison of perturbed and unperturbed marginal distributions using a symmetric Kullback-Leibler divergence. Applications to kidney and skin cancer expression quantitative trait loci (eQTL) data from different mus musculus populations are presented. System-wide effects in the network were predicted and visualized across a spectrum of evidence. Sub-pathways and regions of the network responded in concert, suggesting co-regulation and coordination throughout the network in response to phenotypic changes. We demonstrate how these predicted system-wide effects can be examined in connection with estimated class probabilities for covariates of interest, e.g. cancer status. Despite the uncertainty in the network structure, we demonstrate the system-wide predictions are stable across an ensemble of highly likely networks. A software package, geneNetBP, which implements our approach, was developed in the R programming language.

图形模型已被证明是连接基因型和表型的有价值的工具。表型-基因型网络的结构学习在后基因组时代受到了相当大的关注。近年来，出现了十几种不同的网络推断方法，这些方法利用了某些遗传群体中出现的自然变异。网络本身的结构可以用来根据推断的直接和间接网络关系形成假设，但对于图形分析来说，这是一个过早的终点。在这项工作中，我们扩展了这个端点。我们研究了干扰给定网络结构的未探索问题，并以节点明智的方式量化了对网络的系统范围影响。通过设置表型节点值来实现扰动，表型节点值可能反映抑制或激活，并通过整个网络传播该信息。我们利用条件高斯贝叶斯网络(cg - bn)中的信念传播方法，以便通过网络吸收和传播表型证据。我们表明，基因型-表型网络采用的建模假设代表了cg - bn的一个重要子类，它具有确保在传播方案中精确推断的特性。通过使用对称kullbackleibler散度比较扰动和未扰动的边际分布，以节点方式量化了扰动的系统范围效应。本文介绍了不同小家鼠群体中肾癌和皮肤癌表达数量性状位点(eQTL)数据的应用。通过一系列证据预测和可视化网络中的全系统效应。网络的子通路和区域一致响应，表明整个网络在响应表型变化时共同调节和协调。我们展示了如何将这些预测的全系统效应与相关协变量(如癌症状态)的估计类概率联系起来进行检验。尽管网络结构存在不确定性，但我们证明了系统范围的预测在高度可能的网络集合中是稳定的。用R编程语言开发了一个软件包geneNetBP，它实现了我们的方法。

{"title":"Belief propagation in genotype-phenotype networks","authors":"Janhavi Moharil, Paul May, D. Gaile, R. Blair","doi":"10.1515/sagmb-2015-0058","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0058","url":null,"abstract":"Abstract Graphical models have proven to be a valuable tool for connecting genotypes and phenotypes. Structural learning of phenotype-genotype networks has received considerable attention in the post-genome era. In recent years, a dozen different methods have emerged for network inference, which leverage natural variation that arises in certain genetic populations. The structure of the network itself can be used to form hypotheses based on the inferred direct and indirect network relationships, but represents a premature endpoint to the graphical analyses. In this work, we extend this endpoint. We examine the unexplored problem of perturbing a given network structure, and quantifying the system-wide effects on the network in a node-wise manner. The perturbation is achieved through the setting of values of phenotype node(s), which may reflect an inhibition or activation, and propagating this information through the entire network. We leverage belief propagation methods in Conditional Gaussian Bayesian Networks (CG-BNs), in order to absorb and propagate phenotypic evidence through the network. We show that the modeling assumptions adopted for genotype-phenotype networks represent an important sub-class of CG-BNs, which possess properties that ensure exact inference in the propagation scheme. The system-wide effects of the perturbation are quantified in a node-wise manner through the comparison of perturbed and unperturbed marginal distributions using a symmetric Kullback-Leibler divergence. Applications to kidney and skin cancer expression quantitative trait loci (eQTL) data from different mus musculus populations are presented. System-wide effects in the network were predicted and visualized across a spectrum of evidence. Sub-pathways and regions of the network responded in concert, suggesting co-regulation and coordination throughout the network in response to phenotypic changes. We demonstrate how these predicted system-wide effects can be examined in connection with estimated class probabilities for covariates of interest, e.g. cancer status. Despite the uncertainty in the network structure, we demonstrate the system-wide predictions are stable across an ensemble of highly likely networks. A software package, geneNetBP, which implements our approach, was developed in the R programming language.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"39 - 53"},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0058","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Identification of consistent functional genetic modules 鉴定一致的功能遗传模块

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-03-01 DOI: 10.1515/sagmb-2015-0026

J. Miecznikowski, D. Gaile, Xiwei Chen, D. Tritchler

Abstract It is often of scientific interest to find a set of genes that may represent an independent functional module or network, such as a functional gene expression module causing a biological response, a transcription regulatory network, or a constellation of mutations jointly causing a disease. In this paper we are specifically interested in identifying modules that control a particular outcome variable such as a disease biomarker. We discuss the statistical properties that functional networks should possess and introduce the concept of network consistency which should be satisfied by real functional networks of cooperating genes, and directly use the concept in the pathway discovery method we present. Our method gives superior performance for all but the simplest functional networks.

寻找可能代表一个独立功能模块或网络的一组基因，如引起生物反应的功能基因表达模块、转录调控网络或共同引起疾病的一系列突变，往往具有科学意义。在本文中，我们特别感兴趣的是识别控制特定结果变量(如疾病生物标志物)的模块。我们讨论了功能网络应具备的统计性质，并引入了网络一致性的概念，而网络一致性是由真实的协同基因功能网络所满足的，并将这一概念直接应用于我们所提出的路径发现方法中。除了最简单的函数网络外，我们的方法对所有网络都具有优越的性能。

引用次数: 6

HMM-DM: identifying differentially methylated regions using a hidden Markov model HMM-DM:使用隐马尔可夫模型识别差异甲基化区域

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-03-01 DOI: 10.1515/sagmb-2015-0077

Xiaoqing Yu, Shuying Sun

Abstract DNA methylation is an epigenetic modification involved in organism development and cellular differentiation. Identifying differential methylations can help to study genomic regions associated with diseases. Differential methylation studies on single-CG resolution have become possible with the bisulfite sequencing (BS) technology. However, there is still a lack of efficient statistical methods for identifying differentially methylated (DM) regions in BS data. We have developed a new approach named HMM-DM to detect DM regions between two biological conditions using BS data. This new approach first uses a hidden Markov model (HMM) to identify DM CG sites accounting for spatial correlation across CG sites and variation across samples, and then summarizes identified sites into regions. We demonstrate through a simulation study that our approach has a superior performance compared to BSmooth. We also illustrate the application of HMM-DM using a real breast cancer dataset.

DNA甲基化是一种参与生物发育和细胞分化的表观遗传修饰。鉴定差异甲基化有助于研究与疾病相关的基因组区域。亚硫酸酯测序(BS)技术使单cg分辨率的差异甲基化研究成为可能。然而，仍然缺乏有效的统计方法来识别BS数据中的差异甲基化(DM)区域。我们开发了一种名为HMM-DM的新方法，利用BS数据检测两种生物状态之间的DM区域。该方法首先利用隐马尔可夫模型(HMM)识别DM CG位点，考虑CG位点之间的空间相关性和样本间的差异，然后将识别出的位点归纳为区域。我们通过仿真研究证明，与BSmooth相比，我们的方法具有优越的性能。我们还使用真实的乳腺癌数据集说明了HMM-DM的应用。

引用次数: 30

HMM-Fisher: identifying differential methylation using a hidden Markov model and Fisher’s exact test HMM-Fisher:使用隐马尔可夫模型和Fisher的精确检验来识别差异甲基化

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-03-01 DOI: 10.1515/sagmb-2015-0076

Shuying Sun, Xiaoqing Yu

Abstract DNA methylation is an epigenetic event that plays an important role in regulating gene expression. It is important to study DNA methylation, especially differential methylation patterns between two groups of samples (e.g. patients vs. normal individuals). With next generation sequencing technologies, it is now possible to identify differential methylation patterns by considering methylation at the single CG site level in an entire genome. However, it is challenging to analyze large and complex NGS data. In order to address this difficult question, we have developed a new statistical method using a hidden Markov model and Fisher’s exact test (HMM-Fisher) to identify differentially methylated cytosines and regions. We first use a hidden Markov chain to model the methylation signals to infer the methylation state as Not methylated (N), Partly methylated (P), and Fully methylated (F) for each individual sample. We then use Fisher’s exact test to identify differentially methylated CG sites. We show the HMM-Fisher method and compare it with commonly cited methods using both simulated data and real sequencing data. The results show that HMM-Fisher outperforms the current available methods to which we have compared. HMM-Fisher is efficient and robust in identifying heterogeneous DM regions.

DNA甲基化是一种表观遗传事件，在基因表达调控中起着重要作用。研究DNA甲基化是很重要的，特别是两组样本(例如患者与正常人)之间的甲基化模式差异。随着下一代测序技术的发展，现在有可能通过考虑整个基因组中单个CG位点水平的甲基化来识别不同的甲基化模式。然而，分析大型和复杂的NGS数据是一项挑战。为了解决这个难题，我们开发了一种新的统计方法，使用隐马尔可夫模型和Fisher精确检验(HMM-Fisher)来识别差异甲基化的胞嘧啶和区域。我们首先使用隐马尔可夫链对甲基化信号进行建模，以推断每个样本的甲基化状态为未甲基化(N)、部分甲基化(P)和完全甲基化(F)。然后，我们使用Fisher的精确测试来识别差异甲基化的CG位点。我们展示了HMM-Fisher方法，并使用模拟数据和实际测序数据将其与常用的方法进行了比较。结果表明，HMM-Fisher优于我们所比较的当前可用方法。HMM-Fisher在识别异质DM区域方面是有效且稳健的。

{"title":"HMM-Fisher: identifying differential methylation using a hidden Markov model and Fisher’s exact test","authors":"Shuying Sun, Xiaoqing Yu","doi":"10.1515/sagmb-2015-0076","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0076","url":null,"abstract":"Abstract DNA methylation is an epigenetic event that plays an important role in regulating gene expression. It is important to study DNA methylation, especially differential methylation patterns between two groups of samples (e.g. patients vs. normal individuals). With next generation sequencing technologies, it is now possible to identify differential methylation patterns by considering methylation at the single CG site level in an entire genome. However, it is challenging to analyze large and complex NGS data. In order to address this difficult question, we have developed a new statistical method using a hidden Markov model and Fisher’s exact test (HMM-Fisher) to identify differentially methylated cytosines and regions. We first use a hidden Markov chain to model the methylation signals to infer the methylation state as Not methylated (N), Partly methylated (P), and Fully methylated (F) for each individual sample. We then use Fisher’s exact test to identify differentially methylated CG sites. We show the HMM-Fisher method and compare it with commonly cited methods using both simulated data and real sequencing data. The results show that HMM-Fisher outperforms the current available methods to which we have compared. HMM-Fisher is efficient and robust in identifying heterogeneous DM regions.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"55 - 67"},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0076","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing MDI-GPU:使用GP-GPU计算加速基因组尺度数据的集成建模

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-02-24 DOI: 10.1515/sagmb-2015-0055

Samuel A. Mason, Faiz Sayyid, Paul D. W. Kirk, Colin Starr, D. Wild

Abstract The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/.

多维数据集的集成一直是系统生物学和基因组医学的一个关键挑战。现代高通量技术产生了大量不同的数据类型，提供了不同的(但往往是互补的)信息。然而，大量的数据给任何推理任务都增加了负担。灵活的贝叶斯方法可以减少对强建模假设的需要，但也会增加计算负担。我们提出了一种改进的贝叶斯相关聚类算法的实现，它允许在多个数据集上常规地执行集成聚类，每个数据集都有数万个项目。通过利用基于GPU的计算，我们能够将算法的运行时性能提高近四个数量级。这允许跨基因组规模的数据集进行分析，大大扩展了最初可能的应用范围。MDI可以在这里获得:http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/。

引用次数: 8

Homology cluster differential expression analysis for interspecies mRNA-Seq experiments 种间mRNA-Seq实验的同源聚类差异表达分析

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2015-12-01 DOI: 10.1515/sagmb-2014-0056

J. Gelfond, J. Ibrahim, Ming-Hui Chen, Wei Sun, Kaitlyn N. Lewis, Sean Kinahan, Matthew A. Hibbs, R. Buffenstein

Abstract There is an increasing demand for exploration of the transcriptomes of multiple species with extraordinary traits such as the naked-mole rat (NMR). The NMR is remarkable because of its longevity and resistance to developing cancer. It is of scientific interest to understand the molecular mechanisms that impart these traits, and RNA-sequencing experiments with comparator species can correlate transcriptome dynamics with these phenotypes. Comparing transcriptome differences requires a homology mapping of each transcript in one species to transcript(s) within the other. Such mappings are necessary, especially if one species does not have well-annotated genome available. Current approaches for this type of analysis typically identify the best match for each transcript, but the best match analysis ignores the inherent risks of mismatch when there are multiple candidate transcripts with similar homology scores. We present a method that treats the set of homologs from a novel species as a cluster corresponding to a single gene in the reference species, and we compare the cluster-based approach to a conventional best-match analysis in both simulated data and a case study with NMR and mouse tissues. We demonstrate that the cluster-based approach has superior power to detect differential expression.

对具有特殊性状的多种物种转录组的研究需求日益增加，如裸鼹鼠(NMR)。核磁共振的非凡之处在于它的寿命和抗癌性。了解赋予这些性状的分子机制具有科学意义，比较物种的rna测序实验可以将转录组动力学与这些表型联系起来。比较转录组差异需要一个物种中每个转录本与另一个物种中转录本的同源性映射。这样的映射是必要的，特别是如果一个物种没有很好的注释基因组可用。目前这种类型的分析方法通常为每个转录本确定最佳匹配，但是最佳匹配分析忽略了当存在多个具有相似同源性分数的候选转录本时不匹配的固有风险。我们提出了一种方法，将来自新物种的同系物集作为参考物种中单个基因对应的集群，并将基于集群的方法与传统的最佳匹配分析方法进行了比较，包括模拟数据和核磁共振和小鼠组织的案例研究。我们证明了基于聚类的方法在检测差异表达方面具有优越的能力。

{"title":"Homology cluster differential expression analysis for interspecies mRNA-Seq experiments","authors":"J. Gelfond, J. Ibrahim, Ming-Hui Chen, Wei Sun, Kaitlyn N. Lewis, Sean Kinahan, Matthew A. Hibbs, R. Buffenstein","doi":"10.1515/sagmb-2014-0056","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0056","url":null,"abstract":"Abstract There is an increasing demand for exploration of the transcriptomes of multiple species with extraordinary traits such as the naked-mole rat (NMR). The NMR is remarkable because of its longevity and resistance to developing cancer. It is of scientific interest to understand the molecular mechanisms that impart these traits, and RNA-sequencing experiments with comparator species can correlate transcriptome dynamics with these phenotypes. Comparing transcriptome differences requires a homology mapping of each transcript in one species to transcript(s) within the other. Such mappings are necessary, especially if one species does not have well-annotated genome available. Current approaches for this type of analysis typically identify the best match for each transcript, but the best match analysis ignores the inherent risks of mismatch when there are multiple candidate transcripts with similar homology scores. We present a method that treats the set of homologs from a novel species as a cluster corresponding to a single gene in the reference species, and we compare the cluster-based approach to a conventional best-match analysis in both simulated data and a case study with NMR and mouse tissues. We demonstrate that the cluster-based approach has superior power to detect differential expression.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"14 1","pages":"507 - 516"},"PeriodicalIF":0.9,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

On the validity of within-nuclear-family genetic association analysis in samples of extended families 核心家族遗传关联分析在大家庭样本中的有效性探讨

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2015-12-01 DOI: 10.1515/sagmb-2015-0056

A. Bureau, T. Duchesne

Abstract Splitting extended families into their component nuclear families to apply a genetic association method designed for nuclear families is a widespread practice in familial genetic studies. Dependence among genotypes and phenotypes of nuclear families from the same extended family arises because of genetic linkage of the tested marker with a risk variant or because of familial specificity of genetic effects due to gene-environment interaction. This raises concerns about the validity of inference conducted under the assumption of independence of the nuclear families. We indeed prove theoretically that, in a conditional logistic regression analysis applicable to disease cases and their genotyped parents, the naive model-based estimator of the variance of the coefficient estimates underestimates the true variance. However, simulations with realistic effect sizes of risk variants and variation of this effect from family to family reveal that the underestimation is negligible. The simulations also show the greater efficiency of the model-based variance estimator compared to a robust empirical estimator. Our recommendation is therefore, to use the model-based estimator of variance for inference on effects of genetic variants.

摘要将大家庭划分为核心家庭，应用为核心家庭设计的遗传关联方法是家族遗传学研究中普遍采用的一种方法。来自同一大家庭的核心家庭的基因型和表型之间的依赖性是由于被测标记与风险变异的遗传连锁或由于基因-环境相互作用导致的遗传效应的家族特异性而产生的。这引起了人们对在假定核心家庭独立的情况下所作推断的有效性的关注。我们确实从理论上证明，在适用于疾病病例及其基因型父母的条件逻辑回归分析中，基于朴素模型的系数方差估计器低估了真实方差。然而，模拟风险变量的实际效应大小以及这种效应在家庭之间的变化表明，低估是可以忽略不计的。仿真还表明，与稳健的经验估计器相比，基于模型的方差估计器的效率更高。因此，我们建议使用基于模型的方差估计器来推断遗传变异的影响。

{"title":"On the validity of within-nuclear-family genetic association analysis in samples of extended families","authors":"A. Bureau, T. Duchesne","doi":"10.1515/sagmb-2015-0056","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0056","url":null,"abstract":"Abstract Splitting extended families into their component nuclear families to apply a genetic association method designed for nuclear families is a widespread practice in familial genetic studies. Dependence among genotypes and phenotypes of nuclear families from the same extended family arises because of genetic linkage of the tested marker with a risk variant or because of familial specificity of genetic effects due to gene-environment interaction. This raises concerns about the validity of inference conducted under the assumption of independence of the nuclear families. We indeed prove theoretically that, in a conditional logistic regression analysis applicable to disease cases and their genotyped parents, the naive model-based estimator of the variance of the coefficient estimates underestimates the true variance. However, simulations with realistic effect sizes of risk variants and variation of this effect from family to family reveal that the underestimation is negligible. The simulations also show the greater efficiency of the model-based variance estimator compared to a robust empirical estimator. Our recommendation is therefore, to use the model-based estimator of variance for inference on effects of genetic variants.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"14 1","pages":"533 - 549"},"PeriodicalIF":0.9,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3