首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Stability selection for lasso, ridge and elastic net implemented with AFT models 利用AFT模型实现套索、脊网和弹性网的稳定性选择
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-04-25 DOI: 10.1515/sagmb-2017-0001
M. H. R. Khan, Anamika Bhadra, Tamanna Howlader
Abstract The instability in the selection of models is a major concern with data sets containing a large number of covariates. We focus on stability selection which is used as a technique to improve variable selection performance for a range of selection methods, based on aggregating the results of applying a selection procedure to sub-samples of the data where the observations are subject to right censoring. The accelerated failure time (AFT) models have proved useful in many contexts including the heavy censoring (as for example in cancer survival) and the high dimensionality (as for example in micro-array data). We implement the stability selection approach using three variable selection techniques—Lasso, ridge regression, and elastic net applied to censored data using AFT models. We compare the performances of these regularized techniques with and without stability selection approaches with simulation studies and two real data examples–a breast cancer data and a diffuse large B-cell lymphoma data. The results suggest that stability selection gives always stable scenario about the selection of variables and that as the dimension of data increases the performance of methods with stability selection also improves compared to methods without stability selection irrespective of the collinearity between the covariates.
对于包含大量协变量的数据集,模型选择中的不稳定性是一个主要问题。我们专注于稳定性选择,这是一种用于改善一系列选择方法的变量选择性能的技术,基于将选择程序应用于观测数据的子样本的结果聚合在一起,其中观察结果受到正确的审查。加速失效时间(AFT)模型已被证明在许多情况下是有用的,包括重审查(例如在癌症生存中)和高维(例如在微阵列数据中)。我们使用三种变量选择技术——lasso、脊回归和弹性网来实现稳定性选择方法,这些技术应用于使用AFT模型的截尾数据。我们通过模拟研究和两个真实数据示例(乳腺癌数据和弥漫性大b细胞淋巴瘤数据)比较了这些正则化技术在使用和不使用稳定性选择方法时的性能。结果表明,稳定性选择给出了始终稳定的变量选择场景,并且随着数据维数的增加,无论协变量之间是否共线性,具有稳定性选择的方法的性能也比不具有稳定性选择的方法有所提高。
{"title":"Stability selection for lasso, ridge and elastic net implemented with AFT models","authors":"M. H. R. Khan, Anamika Bhadra, Tamanna Howlader","doi":"10.1515/sagmb-2017-0001","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0001","url":null,"abstract":"Abstract The instability in the selection of models is a major concern with data sets containing a large number of covariates. We focus on stability selection which is used as a technique to improve variable selection performance for a range of selection methods, based on aggregating the results of applying a selection procedure to sub-samples of the data where the observations are subject to right censoring. The accelerated failure time (AFT) models have proved useful in many contexts including the heavy censoring (as for example in cancer survival) and the high dimensionality (as for example in micro-array data). We implement the stability selection approach using three variable selection techniques—Lasso, ridge regression, and elastic net applied to censored data using AFT models. We compare the performances of these regularized techniques with and without stability selection approaches with simulation studies and two real data examples–a breast cancer data and a diffuse large B-cell lymphoma data. The results suggest that stability selection gives always stable scenario about the selection of variables and that as the dimension of data increases the performance of methods with stability selection also improves compared to methods without stability selection irrespective of the collinearity between the covariates.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2016-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A Markov random field-based approach for joint estimation of differentially expressed genes in mouse transcriptome data. 基于马尔可夫随机场的小鼠转录组数据差异表达基因联合估计方法。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2015-0070
Zhixiang Lin, Mingfeng Li, Nenad Sestan, Hongyu Zhao

The statistical methodology developed in this study was motivated by our interest in studying neurodevelopment using the mouse brain RNA-Seq data set, where gene expression levels were measured in multiple layers in the somatosensory cortex across time in both female and male samples. We aim to identify differentially expressed genes between adjacent time points, which may provide insights on the dynamics of brain development. Because of the extremely small sample size (one male and female at each time point), simple marginal analysis may be underpowered. We propose a Markov random field (MRF)-based approach to capitalizing on the between layers similarity, temporal dependency and the similarity between sex. The model parameters are estimated by an efficient EM algorithm with mean field-like approximation. Simulation results and real data analysis suggest that the proposed model improves the power to detect differentially expressed genes than simple marginal analysis. Our method also reveals biologically interesting results in the mouse brain RNA-Seq data set.

本研究采用的统计方法源自我们对利用小鼠大脑 RNA-Seq 数据集研究神经发育的兴趣,在该数据集中,我们测量了雌性和雄性样本体感皮层多层基因在不同时间的表达水平。我们的目标是找出相邻时间点之间表达不同的基因,从而为了解大脑发育的动态提供依据。由于样本量极小(每个时间点只有一名男性和一名女性),简单的边际分析可能无法达到预期效果。我们提出了一种基于马尔可夫随机场(MRF)的方法,以利用层间相似性、时间依赖性和性别相似性。模型参数通过有效的 EM 算法进行估计,并采用类似均值场的近似方法。模拟结果和实际数据分析表明,与简单的边际分析相比,所提出的模型提高了检测差异表达基因的能力。我们的方法还揭示了小鼠大脑 RNA-Seq 数据集中有趣的生物学结果。
{"title":"A Markov random field-based approach for joint estimation of differentially expressed genes in mouse transcriptome data.","authors":"Zhixiang Lin, Mingfeng Li, Nenad Sestan, Hongyu Zhao","doi":"10.1515/sagmb-2015-0070","DOIUrl":"10.1515/sagmb-2015-0070","url":null,"abstract":"<p><p>The statistical methodology developed in this study was motivated by our interest in studying neurodevelopment using the mouse brain RNA-Seq data set, where gene expression levels were measured in multiple layers in the somatosensory cortex across time in both female and male samples. We aim to identify differentially expressed genes between adjacent time points, which may provide insights on the dynamics of brain development. Because of the extremely small sample size (one male and female at each time point), simple marginal analysis may be underpowered. We propose a Markov random field (MRF)-based approach to capitalizing on the between layers similarity, temporal dependency and the similarity between sex. The model parameters are estimated by an efficient EM algorithm with mean field-like approximation. Simulation results and real data analysis suggest that the proposed model improves the power to detect differentially expressed genes than simple marginal analysis. Our method also reveals biologically interesting results in the mouse brain RNA-Seq data set. </p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"139-50"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5587217/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies. AGGrEGATOr:一种基于基因的基因-基因相互作用试验,用于病例对照关联研究。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2015-0074
Mathieu Emily

Among the large of number of statistical methods that have been proposed to identify gene-gene interactions in case-control genome-wide association studies (GWAS), gene-based methods have recently grown in popularity as they confer advantage in both statistical power and biological interpretation. All of the gene-based methods jointly model the distribution of single nucleotide polymorphisms (SNPs) sets prior to the statistical test, leading to a limited power to detect sums of SNP-SNP signals. In this paper, we instead propose a gene-based method that first performs SNP-SNP interaction tests before aggregating the obtained p-values into a test at the gene level. Our method called AGGrEGATOr is based on a minP procedure that tests the significance of the minimum of a set of p-values. We use simulations to assess the capacity of AGGrEGATOr to correctly control for type-I error. The benefits of our approach in terms of statistical power and robustness to SNPs set characteristics are evaluated in a wide range of disease models by comparing it to previous methods. We also apply our method to detect gene pairs associated to rheumatoid arthritis (RA) on the GSE39428 dataset. We identify 13 potential gene-gene interactions and replicate one gene pair in the Wellcome Trust Case Control Consortium dataset at the level of 5%. We further test 15 gene pairs, previously reported as being statistically associated with RA or Crohn's disease (CD) or coronary artery disease (CAD), for replication in the Wellcome Trust Case Control Consortium dataset. We show that AGGrEGATOr is the only method able to successfully replicate seven gene pairs.

在病例对照全基因组关联研究(GWAS)中,已经提出了大量用于识别基因-基因相互作用的统计方法,其中基于基因的方法最近越来越受欢迎,因为它们在统计能力和生物学解释方面都具有优势。所有基于基因的方法都是在统计检验之前联合建模单核苷酸多态性(snp)集的分布,导致检测SNP-SNP信号数量的能力有限。在本文中,我们提出了一种基于基因的方法,首先进行SNP-SNP相互作用测试,然后将获得的p值聚合到基因水平的测试中。我们称为AGGrEGATOr的方法基于一个minP过程,该过程测试一组p值的最小值的显著性。我们使用仿真来评估AGGrEGATOr正确控制i型错误的能力。通过与以前的方法进行比较,我们的方法在统计能力和对snp集特征的鲁棒性方面的优势在广泛的疾病模型中得到了评估。我们还应用我们的方法在GSE39428数据集上检测与类风湿关节炎(RA)相关的基因对。我们确定了13个潜在的基因-基因相互作用,并在Wellcome Trust病例控制联盟数据集中以5%的水平复制了一个基因对。我们进一步测试了15对基因对,这些基因对先前报道与RA或克罗恩病(CD)或冠状动脉疾病(CAD)有统计学相关性,并在威康信托病例控制联盟数据集中进行了复制。我们发现AGGrEGATOr是唯一能够成功复制7对基因的方法。
{"title":"AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies.","authors":"Mathieu Emily","doi":"10.1515/sagmb-2015-0074","DOIUrl":"10.1515/sagmb-2015-0074","url":null,"abstract":"<p><p>Among the large of number of statistical methods that have been proposed to identify gene-gene interactions in case-control genome-wide association studies (GWAS), gene-based methods have recently grown in popularity as they confer advantage in both statistical power and biological interpretation. All of the gene-based methods jointly model the distribution of single nucleotide polymorphisms (SNPs) sets prior to the statistical test, leading to a limited power to detect sums of SNP-SNP signals. In this paper, we instead propose a gene-based method that first performs SNP-SNP interaction tests before aggregating the obtained p-values into a test at the gene level. Our method called AGGrEGATOr is based on a minP procedure that tests the significance of the minimum of a set of p-values. We use simulations to assess the capacity of AGGrEGATOr to correctly control for type-I error. The benefits of our approach in terms of statistical power and robustness to SNPs set characteristics are evaluated in a wide range of disease models by comparing it to previous methods. We also apply our method to detect gene pairs associated to rheumatoid arthritis (RA) on the GSE39428 dataset. We identify 13 potential gene-gene interactions and replicate one gene pair in the Wellcome Trust Case Control Consortium dataset at the level of 5%. We further test 15 gene pairs, previously reported as being statistically associated with RA or Crohn's disease (CD) or coronary artery disease (CAD), for replication in the Wellcome Trust Case Control Consortium dataset. We show that AGGrEGATOr is the only method able to successfully replicate seven gene pairs. </p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"151-71"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment 如果我们在多因素实验中分析RNA-seq数据时忽略随机效应会怎样
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2015-0011
Shiqi Cui, Tieming Ji, Jilong Li, J. Cheng, Jing Qiu
Abstract Identifying differentially expressed (DE) genes between different conditions is one of the main goals of RNA-seq data analysis. Although a large amount of RNA-seq data were produced for two-group comparison with small sample sizes at early stage, more and more RNA-seq data are being produced in the setting of complex experimental designs such as split-plot designs and repeated measure designs. Data arising from such experiments are traditionally analyzed by mixed-effects models. Therefore an appropriate statistical approach for analyzing RNA-seq data from such designs should be generalized linear mixed models (GLMM) or similar approaches that allow for random effects. However, common practices for analyzing such data in literature either treat random effects as fixed or completely ignore the experimental design and focus on two-group comparison using partial data. In this paper, we examine the effect of ignoring the random effects when analyzing RNA-seq data. We accomplish this goal by comparing the standard GLMM model to the methods that ignore the random effects through simulation studies and real data analysis. Our studies show that, ignoring random effects in a multi-factor experiment can lead to the increase of the false positives among the top selected genes or lower power when the nominal FDR level is controlled.
识别不同条件下的差异表达(DE)基因是RNA-seq数据分析的主要目标之一。虽然在早期的两组小样本量比较中产生了大量的RNA-seq数据,但越来越多的RNA-seq数据是在复杂的实验设计中产生的,如分裂图设计和重复测量设计。这类实验产生的数据传统上是用混合效应模型来分析的。因此,分析来自此类设计的RNA-seq数据的适当统计方法应该是广义线性混合模型(GLMM)或允许随机效应的类似方法。然而,文献中分析这类数据的常见做法,要么将随机效应视为固定效应,要么完全忽略实验设计,只关注使用部分数据的两组比较。在本文中,我们研究忽略随机效应在分析RNA-seq数据时的影响。我们通过模拟研究和实际数据分析,将标准GLMM模型与忽略随机效应的方法进行了比较,从而实现了这一目标。我们的研究表明,在多因素实验中,忽略随机效应可能导致首选基因的假阳性增加,或者在名义FDR水平受到控制时导致功率降低。
{"title":"What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment","authors":"Shiqi Cui, Tieming Ji, Jilong Li, J. Cheng, Jing Qiu","doi":"10.1515/sagmb-2015-0011","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0011","url":null,"abstract":"Abstract Identifying differentially expressed (DE) genes between different conditions is one of the main goals of RNA-seq data analysis. Although a large amount of RNA-seq data were produced for two-group comparison with small sample sizes at early stage, more and more RNA-seq data are being produced in the setting of complex experimental designs such as split-plot designs and repeated measure designs. Data arising from such experiments are traditionally analyzed by mixed-effects models. Therefore an appropriate statistical approach for analyzing RNA-seq data from such designs should be generalized linear mixed models (GLMM) or similar approaches that allow for random effects. However, common practices for analyzing such data in literature either treat random effects as fixed or completely ignore the experimental design and focus on two-group comparison using partial data. In this paper, we examine the effect of ignoring the random effects when analyzing RNA-seq data. We accomplish this goal by comparing the standard GLMM model to the methods that ignore the random effects through simulation studies and real data analysis. Our studies show that, ignoring random effects in a multi-factor experiment can lead to the increase of the false positives among the top selected genes or lower power when the nominal FDR level is controlled.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"105 - 87"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
A graph theoretical approach to data fusion. 数据融合的图论方法。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2016-0016
Justina Žurauskienė, Paul D W Kirk, Michael P H Stumpf

The rapid development of high throughput experimental techniques has resulted in a growing diversity of genomic datasets being produced and requiring analysis. Therefore, it is increasingly being recognized that we can gain deeper understanding about underlying biology by combining the insights obtained from multiple, diverse datasets. Thus we propose a novel scalable computational approach to unsupervised data fusion. Our technique exploits network representations of the data to identify similarities among the datasets. We may work within the Bayesian formalism, using Bayesian nonparametric approaches to model each dataset; or (for fast, approximate, and massive scale data fusion) can naturally switch to more heuristic modeling techniques. An advantage of the proposed approach is that each dataset can initially be modeled independently (in parallel), before applying a fast post-processing step to perform data integration. This allows us to incorporate new experimental data in an online fashion, without having to rerun all of the analysis. We first demonstrate the applicability of our tool on artificial data, and then on examples from the literature, which include yeast cell cycle, breast cancer and sporadic inclusion body myositis datasets.

高通量实验技术的快速发展导致了越来越多的基因组数据集的产生和需要分析。因此,人们越来越认识到,通过结合从多个不同数据集获得的见解,我们可以更深入地了解潜在的生物学。因此,我们提出了一种新的可扩展的无监督数据融合计算方法。我们的技术利用数据的网络表示来识别数据集之间的相似性。我们可以在贝叶斯形式化中工作,使用贝叶斯非参数方法对每个数据集建模;或者(对于快速、近似和大规模的数据融合)可以自然地切换到更启发式的建模技术。该方法的一个优点是,在应用快速后处理步骤执行数据集成之前,每个数据集最初可以独立(并行)建模。这使我们能够以在线方式合并新的实验数据,而不必重新运行所有的分析。我们首先展示了我们的工具在人工数据上的适用性,然后是文献中的例子,包括酵母细胞周期、乳腺癌和散发性包涵体肌炎数据集。
{"title":"A graph theoretical approach to data fusion.","authors":"Justina Žurauskienė, Paul D W Kirk, Michael P H Stumpf","doi":"10.1515/sagmb-2016-0016","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0016","url":null,"abstract":"<p><p>The rapid development of high throughput experimental techniques has resulted in a growing diversity of genomic datasets being produced and requiring analysis. Therefore, it is increasingly being recognized that we can gain deeper understanding about underlying biology by combining the insights obtained from multiple, diverse datasets. Thus we propose a novel scalable computational approach to unsupervised data fusion. Our technique exploits network representations of the data to identify similarities among the datasets. We may work within the Bayesian formalism, using Bayesian nonparametric approaches to model each dataset; or (for fast, approximate, and massive scale data fusion) can naturally switch to more heuristic modeling techniques. An advantage of the proposed approach is that each dataset can initially be modeled independently (in parallel), before applying a fast post-processing step to perform data integration. This allows us to incorporate new experimental data in an online fashion, without having to rerun all of the analysis. We first demonstrate the applicability of our tool on artificial data, and then on examples from the literature, which include yeast cell cycle, breast cancer and sporadic inclusion body myositis datasets. </p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 2","pages":"107-22"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144183466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing five statistical methods of differential methylation identification using bisulfite sequencing data 亚硫酸酯测序数据差异甲基化鉴定的五种统计方法比较
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2015-0078
Xiaoqing Yu, Shuying Sun
Abstract We are presenting a comprehensive comparative analysis of five differential methylation (DM) identification methods: methylKit, BSmooth, BiSeq, HMM-DM, and HMM-Fisher, which are developed for bisulfite sequencing (BS) data. We summarize the features of these methods from several analytical aspects and compare their performances using both simulated and real BS datasets. Our comparison results are summarized below. First, parameter settings may largely affect the accuracy of DM identification. Different from default settings, modified parameter settings yield higher sensitivity and/or lower false positive rates. Second, all five methods show more accurate results when identifying simulated DM regions that are long and have small within-group variation, but they have low concordance, probably due to the different approaches they have used for DM identification. Third, HMM-DM and HMM-Fisher yield relatively higher sensitivity and lower false positive rates than others, especially in DM regions with large variation. Finally, we have found that among the three methods that involve methylation estimation (methylKit, BSmooth, and BiSeq), BiSeq can best present raw methylation signals. Therefore, based on these results, we suggest that users select DM identification methods based on the characteristics of their data and the advantages of each method.
摘要:我们对亚硫酸酯测序(BS)数据开发的五种差异甲基化(DM)鉴定方法:methylKit、BSmooth、BiSeq、HMM-DM和HMM-Fisher进行了全面的比较分析。我们从几个分析方面总结了这些方法的特点,并使用模拟和真实BS数据集比较了它们的性能。我们的比较结果总结如下。首先,参数设置可能在很大程度上影响DM识别的准确性。与默认设置不同,修改后的参数设置产生更高的灵敏度和/或更低的误报率。其次,在识别长且组内变异小的模拟糖尿病区域时,这五种方法都显示出更准确的结果,但它们的一致性较低,可能是由于它们用于糖尿病识别的方法不同。第三,HMM-DM和HMM-Fisher的敏感性相对较高,假阳性率相对较低,特别是在变异较大的DM区域。最后,我们发现在涉及甲基化估计的三种方法(methylKit, BSmooth和BiSeq)中,BiSeq可以最好地呈现原始甲基化信号。因此,基于这些结果,我们建议用户根据其数据的特征和每种方法的优势来选择DM识别方法。
{"title":"Comparing five statistical methods of differential methylation identification using bisulfite sequencing data","authors":"Xiaoqing Yu, Shuying Sun","doi":"10.1515/sagmb-2015-0078","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0078","url":null,"abstract":"Abstract We are presenting a comprehensive comparative analysis of five differential methylation (DM) identification methods: methylKit, BSmooth, BiSeq, HMM-DM, and HMM-Fisher, which are developed for bisulfite sequencing (BS) data. We summarize the features of these methods from several analytical aspects and compare their performances using both simulated and real BS datasets. Our comparison results are summarized below. First, parameter settings may largely affect the accuracy of DM identification. Different from default settings, modified parameter settings yield higher sensitivity and/or lower false positive rates. Second, all five methods show more accurate results when identifying simulated DM regions that are long and have small within-group variation, but they have low concordance, probably due to the different approaches they have used for DM identification. Third, HMM-DM and HMM-Fisher yield relatively higher sensitivity and lower false positive rates than others, especially in DM regions with large variation. Finally, we have found that among the three methods that involve methylation estimation (methylKit, BSmooth, and BiSeq), BiSeq can best present raw methylation signals. Therefore, based on these results, we suggest that users select DM identification methods based on the characteristics of their data and the advantages of each method.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"173 - 191"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0078","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Belief propagation in genotype-phenotype networks 基因型-表现型网络中的信念传播
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-03-01 DOI: 10.1515/sagmb-2015-0058
Janhavi Moharil, Paul May, D. Gaile, R. Blair
Abstract Graphical models have proven to be a valuable tool for connecting genotypes and phenotypes. Structural learning of phenotype-genotype networks has received considerable attention in the post-genome era. In recent years, a dozen different methods have emerged for network inference, which leverage natural variation that arises in certain genetic populations. The structure of the network itself can be used to form hypotheses based on the inferred direct and indirect network relationships, but represents a premature endpoint to the graphical analyses. In this work, we extend this endpoint. We examine the unexplored problem of perturbing a given network structure, and quantifying the system-wide effects on the network in a node-wise manner. The perturbation is achieved through the setting of values of phenotype node(s), which may reflect an inhibition or activation, and propagating this information through the entire network. We leverage belief propagation methods in Conditional Gaussian Bayesian Networks (CG-BNs), in order to absorb and propagate phenotypic evidence through the network. We show that the modeling assumptions adopted for genotype-phenotype networks represent an important sub-class of CG-BNs, which possess properties that ensure exact inference in the propagation scheme. The system-wide effects of the perturbation are quantified in a node-wise manner through the comparison of perturbed and unperturbed marginal distributions using a symmetric Kullback-Leibler divergence. Applications to kidney and skin cancer expression quantitative trait loci (eQTL) data from different mus musculus populations are presented. System-wide effects in the network were predicted and visualized across a spectrum of evidence. Sub-pathways and regions of the network responded in concert, suggesting co-regulation and coordination throughout the network in response to phenotypic changes. We demonstrate how these predicted system-wide effects can be examined in connection with estimated class probabilities for covariates of interest, e.g. cancer status. Despite the uncertainty in the network structure, we demonstrate the system-wide predictions are stable across an ensemble of highly likely networks. A software package, geneNetBP, which implements our approach, was developed in the R programming language.
图形模型已被证明是连接基因型和表型的有价值的工具。表型-基因型网络的结构学习在后基因组时代受到了相当大的关注。近年来,出现了十几种不同的网络推断方法,这些方法利用了某些遗传群体中出现的自然变异。网络本身的结构可以用来根据推断的直接和间接网络关系形成假设,但对于图形分析来说,这是一个过早的终点。在这项工作中,我们扩展了这个端点。我们研究了干扰给定网络结构的未探索问题,并以节点明智的方式量化了对网络的系统范围影响。通过设置表型节点值来实现扰动,表型节点值可能反映抑制或激活,并通过整个网络传播该信息。我们利用条件高斯贝叶斯网络(cg - bn)中的信念传播方法,以便通过网络吸收和传播表型证据。我们表明,基因型-表型网络采用的建模假设代表了cg - bn的一个重要子类,它具有确保在传播方案中精确推断的特性。通过使用对称kullbackleibler散度比较扰动和未扰动的边际分布,以节点方式量化了扰动的系统范围效应。本文介绍了不同小家鼠群体中肾癌和皮肤癌表达数量性状位点(eQTL)数据的应用。通过一系列证据预测和可视化网络中的全系统效应。网络的子通路和区域一致响应,表明整个网络在响应表型变化时共同调节和协调。我们展示了如何将这些预测的全系统效应与相关协变量(如癌症状态)的估计类概率联系起来进行检验。尽管网络结构存在不确定性,但我们证明了系统范围的预测在高度可能的网络集合中是稳定的。用R编程语言开发了一个软件包geneNetBP,它实现了我们的方法。
{"title":"Belief propagation in genotype-phenotype networks","authors":"Janhavi Moharil, Paul May, D. Gaile, R. Blair","doi":"10.1515/sagmb-2015-0058","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0058","url":null,"abstract":"Abstract Graphical models have proven to be a valuable tool for connecting genotypes and phenotypes. Structural learning of phenotype-genotype networks has received considerable attention in the post-genome era. In recent years, a dozen different methods have emerged for network inference, which leverage natural variation that arises in certain genetic populations. The structure of the network itself can be used to form hypotheses based on the inferred direct and indirect network relationships, but represents a premature endpoint to the graphical analyses. In this work, we extend this endpoint. We examine the unexplored problem of perturbing a given network structure, and quantifying the system-wide effects on the network in a node-wise manner. The perturbation is achieved through the setting of values of phenotype node(s), which may reflect an inhibition or activation, and propagating this information through the entire network. We leverage belief propagation methods in Conditional Gaussian Bayesian Networks (CG-BNs), in order to absorb and propagate phenotypic evidence through the network. We show that the modeling assumptions adopted for genotype-phenotype networks represent an important sub-class of CG-BNs, which possess properties that ensure exact inference in the propagation scheme. The system-wide effects of the perturbation are quantified in a node-wise manner through the comparison of perturbed and unperturbed marginal distributions using a symmetric Kullback-Leibler divergence. Applications to kidney and skin cancer expression quantitative trait loci (eQTL) data from different mus musculus populations are presented. System-wide effects in the network were predicted and visualized across a spectrum of evidence. Sub-pathways and regions of the network responded in concert, suggesting co-regulation and coordination throughout the network in response to phenotypic changes. We demonstrate how these predicted system-wide effects can be examined in connection with estimated class probabilities for covariates of interest, e.g. cancer status. Despite the uncertainty in the network structure, we demonstrate the system-wide predictions are stable across an ensemble of highly likely networks. A software package, geneNetBP, which implements our approach, was developed in the R programming language.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"39 - 53"},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0058","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Identification of consistent functional genetic modules 鉴定一致的功能遗传模块
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-03-01 DOI: 10.1515/sagmb-2015-0026
J. Miecznikowski, D. Gaile, Xiwei Chen, D. Tritchler
Abstract It is often of scientific interest to find a set of genes that may represent an independent functional module or network, such as a functional gene expression module causing a biological response, a transcription regulatory network, or a constellation of mutations jointly causing a disease. In this paper we are specifically interested in identifying modules that control a particular outcome variable such as a disease biomarker. We discuss the statistical properties that functional networks should possess and introduce the concept of network consistency which should be satisfied by real functional networks of cooperating genes, and directly use the concept in the pathway discovery method we present. Our method gives superior performance for all but the simplest functional networks.
寻找可能代表一个独立功能模块或网络的一组基因,如引起生物反应的功能基因表达模块、转录调控网络或共同引起疾病的一系列突变,往往具有科学意义。在本文中,我们特别感兴趣的是识别控制特定结果变量(如疾病生物标志物)的模块。我们讨论了功能网络应具备的统计性质,并引入了网络一致性的概念,而网络一致性是由真实的协同基因功能网络所满足的,并将这一概念直接应用于我们所提出的路径发现方法中。除了最简单的函数网络外,我们的方法对所有网络都具有优越的性能。
{"title":"Identification of consistent functional genetic modules","authors":"J. Miecznikowski, D. Gaile, Xiwei Chen, D. Tritchler","doi":"10.1515/sagmb-2015-0026","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0026","url":null,"abstract":"Abstract It is often of scientific interest to find a set of genes that may represent an independent functional module or network, such as a functional gene expression module causing a biological response, a transcription regulatory network, or a constellation of mutations jointly causing a disease. In this paper we are specifically interested in identifying modules that control a particular outcome variable such as a disease biomarker. We discuss the statistical properties that functional networks should possess and introduce the concept of network consistency which should be satisfied by real functional networks of cooperating genes, and directly use the concept in the pathway discovery method we present. Our method gives superior performance for all but the simplest functional networks.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"1 - 18"},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing. MDI-GPU:使用GP-GPU计算加速基因组尺度数据的集成建模。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-03-01 DOI: 10.1515/sagmb-2015-0055
Samuel A Mason, Faiz Sayyid, Paul D W Kirk, Colin Starr, David L Wild

The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct--but often complementary--information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/.

多维数据集的整合仍然是系统生物学和基因组医学的一个关键挑战。现代高吞吐量技术产生了大量不同的数据类型,提供了不同的(但往往是互补的)信息。然而,大量的数据给任何推理任务都增加了负担。灵活的贝叶斯方法可以减少对强建模假设的需要,但也会增加计算负担。我们提出了一种改进的贝叶斯相关聚类算法的实现,它允许在多个数据集上常规地执行集成聚类,每个数据集都有数万个项目。通过利用基于GPU的计算,我们能够将算法的运行时性能提高近四个数量级。这允许跨基因组规模的数据集进行分析,大大扩展了最初可能的应用范围。MDI可以在这里获得:http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/。
{"title":"MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing.","authors":"Samuel A Mason, Faiz Sayyid, Paul D W Kirk, Colin Starr, David L Wild","doi":"10.1515/sagmb-2015-0055","DOIUrl":"10.1515/sagmb-2015-0055","url":null,"abstract":"<p><p>The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct--but often complementary--information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"83-6"},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HMM-DM: identifying differentially methylated regions using a hidden Markov model HMM-DM:使用隐马尔可夫模型识别差异甲基化区域
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-03-01 DOI: 10.1515/sagmb-2015-0077
Xiaoqing Yu, Shuying Sun
Abstract DNA methylation is an epigenetic modification involved in organism development and cellular differentiation. Identifying differential methylations can help to study genomic regions associated with diseases. Differential methylation studies on single-CG resolution have become possible with the bisulfite sequencing (BS) technology. However, there is still a lack of efficient statistical methods for identifying differentially methylated (DM) regions in BS data. We have developed a new approach named HMM-DM to detect DM regions between two biological conditions using BS data. This new approach first uses a hidden Markov model (HMM) to identify DM CG sites accounting for spatial correlation across CG sites and variation across samples, and then summarizes identified sites into regions. We demonstrate through a simulation study that our approach has a superior performance compared to BSmooth. We also illustrate the application of HMM-DM using a real breast cancer dataset.
DNA甲基化是一种参与生物发育和细胞分化的表观遗传修饰。鉴定差异甲基化有助于研究与疾病相关的基因组区域。亚硫酸酯测序(BS)技术使单cg分辨率的差异甲基化研究成为可能。然而,仍然缺乏有效的统计方法来识别BS数据中的差异甲基化(DM)区域。我们开发了一种名为HMM-DM的新方法,利用BS数据检测两种生物状态之间的DM区域。该方法首先利用隐马尔可夫模型(HMM)识别DM CG位点,考虑CG位点之间的空间相关性和样本间的差异,然后将识别出的位点归纳为区域。我们通过仿真研究证明,与BSmooth相比,我们的方法具有优越的性能。我们还使用真实的乳腺癌数据集说明了HMM-DM的应用。
{"title":"HMM-DM: identifying differentially methylated regions using a hidden Markov model","authors":"Xiaoqing Yu, Shuying Sun","doi":"10.1515/sagmb-2015-0077","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0077","url":null,"abstract":"Abstract DNA methylation is an epigenetic modification involved in organism development and cellular differentiation. Identifying differential methylations can help to study genomic regions associated with diseases. Differential methylation studies on single-CG resolution have become possible with the bisulfite sequencing (BS) technology. However, there is still a lack of efficient statistical methods for identifying differentially methylated (DM) regions in BS data. We have developed a new approach named HMM-DM to detect DM regions between two biological conditions using BS data. This new approach first uses a hidden Markov model (HMM) to identify DM CG sites accounting for spatial correlation across CG sites and variation across samples, and then summarizes identified sites into regions. We demonstrate through a simulation study that our approach has a superior performance compared to BSmooth. We also illustrate the application of HMM-DM using a real breast cancer dataset.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"69 - 81"},"PeriodicalIF":0.9,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1