首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Generalized partial linear varying multi-index coefficient model for gene-environment interactions 基因-环境相互作用的广义偏线性变多指标系数模型
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2017-03-01 DOI: 10.1515/sagmb-2016-0045
Xu Liu, Bin Gao, Yuehua Cui
Abstract Epidemiological studies have suggested the joint effect of simultaneous exposures to multiple environments on disease risk. However, how environmental mixtures as a whole jointly modify genetic effect on disease risk is still largely unknown. Given the importance of gene-environment (G×E) interactions on many complex diseases, rigorously assessing the interaction effect between genes and environmental mixtures as a whole could shed novel insights into the etiology of complex diseases. For this purpose, we propose a generalized partial linear varying multi-index coefficient model (GPLVMICM) to capture the genetic effect on disease risk modulated by multiple environments as a whole. GPLVMICM is semiparametric in nature which allows different index loading parameters in different index functions. We estimate the parametric parameters by a profile procedure, and the nonparametric index functions by a B-spline backfitted kernel method. Under some regularity conditions, the proposed parametric and nonparametric estimators are shown to be consistent and asymptotically normal. We propose a generalized likelihood ratio (GLR) test to rigorously assess the linearity of the interaction effect between multiple environments and a gene, while apply a parametric likelihood test to detect linear G×E interaction effect. The finite sample performance of the proposed method is examined through simulation studies and is further illustrated through a real data analysis.
摘要流行病学研究表明,同时暴露于多种环境对疾病风险的联合影响。然而,环境混合物作为一个整体如何共同改变遗传对疾病风险的影响在很大程度上仍然未知。鉴于基因-环境(G×E)相互作用对许多复杂疾病的重要性,严格评估基因和环境混合物之间的相互作用效应可以为复杂疾病的病因提供新的见解。为此,我们提出了一个广义偏线性变多指标系数模型(GPLVMICM),以捕捉遗传对多个环境整体调节的疾病风险的影响。GPLVMMCM本质上是半参数的,它允许在不同的索引函数中使用不同的索引加载参数。我们用轮廓法估计参数,用B样条反推核方法估计非参数指标函数。在一定的正则性条件下,证明了所提出的参数和非参数估计是一致的和渐近正态的。我们提出了一种广义似然比(GLR)检验来严格评估多个环境和基因之间相互作用效应的线性,同时应用参数似然检验来检测线性G×E相互作用效应。通过仿真研究检验了该方法的有限样本性能,并通过实际数据分析进一步说明了该方法。
{"title":"Generalized partial linear varying multi-index coefficient model for gene-environment interactions","authors":"Xu Liu, Bin Gao, Yuehua Cui","doi":"10.1515/sagmb-2016-0045","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0045","url":null,"abstract":"Abstract Epidemiological studies have suggested the joint effect of simultaneous exposures to multiple environments on disease risk. However, how environmental mixtures as a whole jointly modify genetic effect on disease risk is still largely unknown. Given the importance of gene-environment (G×E) interactions on many complex diseases, rigorously assessing the interaction effect between genes and environmental mixtures as a whole could shed novel insights into the etiology of complex diseases. For this purpose, we propose a generalized partial linear varying multi-index coefficient model (GPLVMICM) to capture the genetic effect on disease risk modulated by multiple environments as a whole. GPLVMICM is semiparametric in nature which allows different index loading parameters in different index functions. We estimate the parametric parameters by a profile procedure, and the nonparametric index functions by a B-spline backfitted kernel method. Under some regularity conditions, the proposed parametric and nonparametric estimators are shown to be consistent and asymptotically normal. We propose a generalized likelihood ratio (GLR) test to rigorously assess the linearity of the interaction effect between multiple environments and a gene, while apply a parametric likelihood test to detect linear G×E interaction effect. The finite sample performance of the proposed method is examined through simulation studies and is further illustrated through a real data analysis.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"59 - 74"},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44875724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Statistical models and computational algorithms for discovering relationships in microbiome data 发现微生物组数据关系的统计模型和计算算法
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2017-01-01 DOI: 10.1515/sagmb-2015-0096
M. Shaikh, J. Beyene
Abstract Microbiomes, populations of microscopic organisms, have been found to be related to human health and it is expected further investigations will lead to novel perspectives of disease. The data used to analyze microbiomes is one of the newest types (the result of high-throughput technology) and the means to analyze these data is still rapidly evolving. One of the distributions that have been introduced into the microbiome literature, the Dirichlet-Multinomial, has received considerable attention. We extend this distribution’s use uncover compositional relationships between organisms at a taxonomic level. We apply our new method in two real microbiome data sets: one from human nasal passages and another from human stool samples.
微生物群,即微生物种群,已被发现与人类健康有关,预计进一步的研究将带来新的疾病视角。用于分析微生物组的数据是最新类型之一(高通量技术的结果),分析这些数据的手段仍在迅速发展。其中一个已经被引入微生物组文献的分布,Dirichlet-Multinomial,已经受到了相当大的关注。我们扩展了这种分布的用途,揭示了生物在分类学水平上的组成关系。我们将我们的新方法应用于两个真实的微生物组数据集:一个来自人类鼻腔通道,另一个来自人类粪便样本。
{"title":"Statistical models and computational algorithms for discovering relationships in microbiome data","authors":"M. Shaikh, J. Beyene","doi":"10.1515/sagmb-2015-0096","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0096","url":null,"abstract":"Abstract Microbiomes, populations of microscopic organisms, have been found to be related to human health and it is expected further investigations will lead to novel perspectives of disease. The data used to analyze microbiomes is one of the newest types (the result of high-throughput technology) and the means to analyze these data is still rapidly evolving. One of the distributions that have been introduced into the microbiome literature, the Dirichlet-Multinomial, has received considerable attention. We extend this distribution’s use uncover compositional relationships between organisms at a taxonomic level. We apply our new method in two real microbiome data sets: one from human nasal passages and another from human stool samples.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"1 - 12"},"PeriodicalIF":0.9,"publicationDate":"2017-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0096","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sample size calculation based on generalized linear models for differential expression analysis in RNA-seq data 基于广义线性模型的RNA-seq数据差分表达分析样本量计算
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-12-01 DOI: 10.1515/sagmb-2016-0008
Chung-I Li, Y. Shyr
Abstract As RNA-seq rapidly develops and costs continually decrease, the quantity and frequency of samples being sequenced will grow exponentially. With proteomic investigations becoming more multivariate and quantitative, determining a study’s optimal sample size is now a vital step in experimental design. Current methods for calculating a study’s required sample size are mostly based on the hypothesis testing framework, which assumes each gene count can be modeled through Poisson or negative binomial distributions; however, these methods are limited when it comes to accommodating covariates. To address this limitation, we propose an estimating procedure based on the generalized linear model. This easy-to-use method constructs a representative exemplary dataset and estimates the conditional power, all without requiring complicated mathematical approximations or formulas. Even more attractive, the downstream analysis can be performed with current R/Bioconductor packages. To demonstrate the practicability and efficiency of this method, we apply it to three real-world studies, and introduce our on-line calculator developed to determine the optimal sample size for a RNA-seq study.
随着RNA-seq技术的快速发展和成本的不断降低,测序样品的数量和频率将呈指数级增长。随着蛋白质组学研究变得更加多元和定量,确定研究的最佳样本量现在是实验设计的重要一步。目前计算研究所需样本大小的方法主要基于假设检验框架,该框架假设每个基因计数可以通过泊松分布或负二项分布进行建模;然而,当涉及到容纳协变量时,这些方法是有限的。为了解决这一限制,我们提出了一种基于广义线性模型的估计方法。这种易于使用的方法构建了一个代表性的示例数据集并估计条件功率,所有这些都不需要复杂的数学近似或公式。更有吸引力的是,下游分析可以使用当前的R/Bioconductor封装进行。为了证明该方法的实用性和效率,我们将其应用于三个现实世界的研究,并介绍了我们开发的在线计算器,以确定RNA-seq研究的最佳样本量。
{"title":"Sample size calculation based on generalized linear models for differential expression analysis in RNA-seq data","authors":"Chung-I Li, Y. Shyr","doi":"10.1515/sagmb-2016-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0008","url":null,"abstract":"Abstract As RNA-seq rapidly develops and costs continually decrease, the quantity and frequency of samples being sequenced will grow exponentially. With proteomic investigations becoming more multivariate and quantitative, determining a study’s optimal sample size is now a vital step in experimental design. Current methods for calculating a study’s required sample size are mostly based on the hypothesis testing framework, which assumes each gene count can be modeled through Poisson or negative binomial distributions; however, these methods are limited when it comes to accommodating covariates. To address this limitation, we propose an estimating procedure based on the generalized linear model. This easy-to-use method constructs a representative exemplary dataset and estimates the conditional power, all without requiring complicated mathematical approximations or formulas. Even more attractive, the downstream analysis can be performed with current R/Bioconductor packages. To demonstrate the practicability and efficiency of this method, we apply it to three real-world studies, and introduce our on-line calculator developed to determine the optimal sample size for a RNA-seq study.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"30 1","pages":"491 - 505"},"PeriodicalIF":0.9,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments 高通量测序或蛋白质组学实验中特征子集相关计数数据的模拟框架
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-10-01 DOI: 10.1515/sagmb-2015-0082
Jochen Kruppa, F. Kramer, T. Beissbarth, K. Jung
Abstract As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.
作为高通量测序实验数据处理的一部分,产生的计数数据代表了映射到特定基因组区域的读取量。计数数据也出现在质谱实验中,用于检测蛋白质之间的相互作用。为了评估新的计算方法来分析序列计数数据或蛋白质组学实验的光谱计数数据,因此需要人工计数数据。虽然已经提出了一些生成人工测序计数数据的方法,但它们都是模拟单次测序运行,从而忽略了个体基因组特征之间的相关结构,或者它们仅限于特定结构。我们建议从多元正态分布中提取相关数据,并对这些连续数据进行四舍五入,以获得离散计数。在我们的方法中,所需的分布参数可以以不同的方式构造或从实际计数数据估计。由于舍入影响相关结构,我们评估了已经在DNA微阵列人工表达数据中使用的收缩估计器的使用。我们的方法被证明对特征的定义子集(如单个路径或GO类别)的计数模拟很有用。
{"title":"A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments","authors":"Jochen Kruppa, F. Kramer, T. Beissbarth, K. Jung","doi":"10.1515/sagmb-2015-0082","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0082","url":null,"abstract":"Abstract As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"401 - 414"},"PeriodicalIF":0.9,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0082","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Accounting for isotopic clustering in Fourier transform mass spectrometry data analysis for clinical diagnostic studies 用于临床诊断研究的傅里叶变换质谱数据分析中的同位素聚类
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-10-01 DOI: 10.1515/sagmb-2016-0005
A. Kakourou, W. Vach, S. Nicolardi, Y. V. D. van der Burgt, B. Mertens
Abstract Mass spectrometry based clinical proteomics has emerged as a powerful tool for high-throughput protein profiling and biomarker discovery. Recent improvements in mass spectrometry technology have boosted the potential of proteomic studies in biomedical research. However, the complexity of the proteomic expression introduces new statistical challenges in summarizing and analyzing the acquired data. Statistical methods for optimally processing proteomic data are currently a growing field of research. In this paper we present simple, yet appropriate methods to preprocess, summarize and analyze high-throughput MALDI-FTICR mass spectrometry data, collected in a case-control fashion, while dealing with the statistical challenges that accompany such data. The known statistical properties of the isotopic distribution of the peptide molecules are used to preprocess the spectra and translate the proteomic expression into a condensed data set. Information on either the intensity level or the shape of the identified isotopic clusters is used to derive summary measures on which diagnostic rules for disease status allocation will be based. Results indicate that both the shape of the identified isotopic clusters and the overall intensity level carry information on the class outcome and can be used to predict the presence or absence of the disease.
基于质谱的临床蛋白质组学已经成为高通量蛋白质分析和生物标志物发现的有力工具。质谱技术的最新改进提高了生物医学研究中蛋白质组学研究的潜力。然而,蛋白质组学表达的复杂性为总结和分析获得的数据带来了新的统计挑战。最佳处理蛋白质组学数据的统计方法目前是一个不断发展的研究领域。在本文中,我们提出了简单而适当的方法来预处理,总结和分析高通量MALDI-FTICR质谱数据,以病例对照的方式收集,同时处理伴随这些数据的统计挑战。利用已知的肽分子同位素分布的统计特性对光谱进行预处理,并将蛋白质组学表达转化为浓缩的数据集。关于已确定的同位素簇的强度水平或形状的信息被用于得出疾病状态分配诊断规则所依据的汇总度量。结果表明,鉴定的同位素簇的形状和总体强度水平都携带了分类结果的信息,可用于预测疾病的存在与否。
{"title":"Accounting for isotopic clustering in Fourier transform mass spectrometry data analysis for clinical diagnostic studies","authors":"A. Kakourou, W. Vach, S. Nicolardi, Y. V. D. van der Burgt, B. Mertens","doi":"10.1515/sagmb-2016-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0005","url":null,"abstract":"Abstract Mass spectrometry based clinical proteomics has emerged as a powerful tool for high-throughput protein profiling and biomarker discovery. Recent improvements in mass spectrometry technology have boosted the potential of proteomic studies in biomedical research. However, the complexity of the proteomic expression introduces new statistical challenges in summarizing and analyzing the acquired data. Statistical methods for optimally processing proteomic data are currently a growing field of research. In this paper we present simple, yet appropriate methods to preprocess, summarize and analyze high-throughput MALDI-FTICR mass spectrometry data, collected in a case-control fashion, while dealing with the statistical challenges that accompany such data. The known statistical properties of the isotopic distribution of the peptide molecules are used to preprocess the spectra and translate the proteomic expression into a condensed data set. Information on either the intensity level or the shape of the identified isotopic clusters is used to derive summary measures on which diagnostic rules for disease status allocation will be based. Results indicate that both the shape of the identified isotopic clusters and the overall intensity level carry information on the class outcome and can be used to predict the presence or absence of the disease.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"415 - 430"},"PeriodicalIF":0.9,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Model selection for factorial Gaussian graphical models with an application to dynamic regulatory networks 阶乘高斯图形模型的模型选择及其在动态调节网络中的应用
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-06-01 DOI: 10.1515/sagmb-2014-0075
V. Vinciotti, L. Augugliaro, A. Abbruzzo, E. Wit
Abstract Factorial Gaussian graphical Models (fGGMs) have recently been proposed for inferring dynamic gene regulatory networks from genomic high-throughput data. In the search for true regulatory relationships amongst the vast space of possible networks, these models allow the imposition of certain restrictions on the dynamic nature of these relationships, such as Markov dependencies of low order – some entries of the precision matrix are a priori zeros – or equal dependency strengths across time lags – some entries of the precision matrix are assumed to be equal. The precision matrix is then estimated by l1-penalized maximum likelihood, imposing a further constraint on the absolute value of its entries, which results in sparse networks. Selecting the optimal sparsity level is a major challenge for this type of approaches. In this paper, we evaluate the performance of a number of model selection criteria for fGGMs by means of two simulated regulatory networks from realistic biological processes. The analysis reveals a good performance of fGGMs in comparison with other methods for inferring dynamic networks and of the KLCV criterion in particular for model selection. Finally, we present an application on a high-resolution time-course microarray data from the Neisseria meningitidis bacterium, a causative agent of life-threatening infections such as meningitis. The methodology described in this paper is implemented in the R package sglasso, freely available at CRAN, http://CRAN.R-project.org/package=sglasso.
因子高斯图模型(fGGMs)最近被提出用于从基因组高通量数据推断动态基因调控网络。在巨大的可能网络空间中寻找真正的调节关系时,这些模型允许对这些关系的动态性质施加某些限制,例如低阶的马尔可夫依赖性-精度矩阵的一些条目是先验的零-或跨时间滞后的相等依赖强度-精度矩阵的一些条目被假设为相等。然后通过11惩罚的最大似然估计精度矩阵,对其条目的绝对值施加进一步的约束,从而产生稀疏网络。选择最优的稀疏度级别是这类方法的主要挑战。在本文中,我们通过两个模拟现实生物过程的调节网络来评估fGGMs的一些模型选择标准的性能。分析表明,与其他推断动态网络的方法和KLCV准则相比,fGGMs具有良好的性能,特别是在模型选择方面。最后,我们介绍了一种高分辨率时程微阵列数据的应用,该数据来自脑膜炎奈瑟菌,一种危及生命的感染,如脑膜炎的病原体。本文中描述的方法是在R软件包sglassso中实现的,可以在CRAN上免费获得,http://CRAN.R-project.org/package=sglasso。
{"title":"Model selection for factorial Gaussian graphical models with an application to dynamic regulatory networks","authors":"V. Vinciotti, L. Augugliaro, A. Abbruzzo, E. Wit","doi":"10.1515/sagmb-2014-0075","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0075","url":null,"abstract":"Abstract Factorial Gaussian graphical Models (fGGMs) have recently been proposed for inferring dynamic gene regulatory networks from genomic high-throughput data. In the search for true regulatory relationships amongst the vast space of possible networks, these models allow the imposition of certain restrictions on the dynamic nature of these relationships, such as Markov dependencies of low order – some entries of the precision matrix are a priori zeros – or equal dependency strengths across time lags – some entries of the precision matrix are assumed to be equal. The precision matrix is then estimated by l1-penalized maximum likelihood, imposing a further constraint on the absolute value of its entries, which results in sparse networks. Selecting the optimal sparsity level is a major challenge for this type of approaches. In this paper, we evaluate the performance of a number of model selection criteria for fGGMs by means of two simulated regulatory networks from realistic biological processes. The analysis reveals a good performance of fGGMs in comparison with other methods for inferring dynamic networks and of the KLCV criterion in particular for model selection. Finally, we present an application on a high-resolution time-course microarray data from the Neisseria meningitidis bacterium, a causative agent of life-threatening infections such as meningitis. The methodology described in this paper is implemented in the R package sglasso, freely available at CRAN, http://CRAN.R-project.org/package=sglasso.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"193 - 212"},"PeriodicalIF":0.9,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0075","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Testing differentially expressed genes in dose-response studies and with ordinal phenotypes 在剂量反应研究和顺序表型中检测差异表达基因
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-06-01 DOI: 10.1515/sagmb-2015-0091
E. Sweeney, C. Crainiceanu, J. Gertheiss
Abstract When testing for differentially expressed genes between more than two groups, the groups are often defined by dose levels in dose-response experiments or ordinal phenotypes, such as disease stages. We discuss the potential of a new approach that uses the levels’ ordering without making any structural assumptions, such as monotonicity, by testing for zero variance components in a mixed models framework. Since the mixed effects model approach borrows strength across doses/levels, the test proposed can also be applied when the number of dose levels/phenotypes is large and/or the number of subjects per group is small. We illustrate the new test in simulation studies and on several publicly available datasets and compare it to alternative testing procedures. All tests considered are implemented in R and are publicly available. The new approach offers a very fast and powerful way to test for differentially expressed genes between ordered groups without making restrictive assumptions with respect to the true relationship between factor levels and response.
当检测多于两组之间的差异表达基因时,通常根据剂量反应实验中的剂量水平或顺序表型(如疾病分期)来定义组。我们讨论了一种新方法的潜力,该方法通过测试混合模型框架中的零方差成分,在不做任何结构性假设(如单调性)的情况下使用水平排序。由于混合效应模型方法借鉴了跨剂量/水平的强度,因此所提出的检验也可以应用于剂量水平/表型数量大和/或每组受试者数量小的情况。我们在模拟研究和几个公开可用的数据集上说明了新的测试,并将其与其他测试程序进行了比较。所有考虑的测试都是用R实现的,并且是公开可用的。新方法提供了一种非常快速和强大的方法来测试有序群体之间差异表达的基因,而无需对因子水平和反应之间的真实关系做出限制性假设。
{"title":"Testing differentially expressed genes in dose-response studies and with ordinal phenotypes","authors":"E. Sweeney, C. Crainiceanu, J. Gertheiss","doi":"10.1515/sagmb-2015-0091","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0091","url":null,"abstract":"Abstract When testing for differentially expressed genes between more than two groups, the groups are often defined by dose levels in dose-response experiments or ordinal phenotypes, such as disease stages. We discuss the potential of a new approach that uses the levels’ ordering without making any structural assumptions, such as monotonicity, by testing for zero variance components in a mixed models framework. Since the mixed effects model approach borrows strength across doses/levels, the test proposed can also be applied when the number of dose levels/phenotypes is large and/or the number of subjects per group is small. We illustrate the new test in simulation studies and on several publicly available datasets and compare it to alternative testing procedures. All tests considered are implemented in R and are publicly available. The new approach offers a very fast and powerful way to test for differentially expressed genes between ordered groups without making restrictive assumptions with respect to the true relationship between factor levels and response.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"55 1","pages":"213 - 235"},"PeriodicalIF":0.9,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0091","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Differential methylation tests of regulatory regions 调控区的差异甲基化试验
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-06-01 DOI: 10.1515/sagmb-2015-0037
D. Ryu, Hongyang Xu, Varghese George, S. Su, Xiaoling Wang, Huidong Shi, R. Podolsky
Abstract Differential methylation of regulatory elements is critical in epigenetic researches and can be statistically tested. We developed a new statistical test, the generalized integrated functional test (GIFT), that tests for regional differences in methylation based on the methylation percent at each CpG site within a genomic region. The GIFT uses estimated subject-specific profiles with smoothing methods, specifically wavelet smoothing, and calculates an ANOVA-like test to compare the average profile of groups. In this way, possibly correlated CpG sites within the regulatory region are compared all together. Simulations and analyses of data obtained from patients with chronic lymphocytic leukemia indicate that GIFT has good statistical properties and is able to identify promising genomic regions. Further, GIFT is likely to work with multiple different types of experiments since different smoothing methods can be used to estimate the profiles of data without noise. Matlab code for GIFT and sample data are available at http://www.augusta.edu/mcg/biostatepi/people/software/gift.html.
调控元件的差异甲基化在表观遗传学研究中至关重要,并且可以进行统计学检验。我们开发了一种新的统计测试,即广义综合功能测试(GIFT),该测试基于基因组区域内每个CpG位点的甲基化百分比来测试甲基化的区域差异。GIFT使用平滑方法(特别是小波平滑)估计的受试者特定特征,并计算类似anova的检验来比较各组的平均特征。通过这种方式,将调控区域内可能相关的CpG位点全部进行比较。对慢性淋巴细胞白血病患者数据的模拟和分析表明,GIFT具有良好的统计特性,能够识别有希望的基因组区域。此外,GIFT可能适用于多种不同类型的实验,因为可以使用不同的平滑方法来估计无噪声数据的轮廓。GIFT的Matlab代码和样本数据可在http://www.augusta.edu/mcg/biostatepi/people/software/gift.html获得。
{"title":"Differential methylation tests of regulatory regions","authors":"D. Ryu, Hongyang Xu, Varghese George, S. Su, Xiaoling Wang, Huidong Shi, R. Podolsky","doi":"10.1515/sagmb-2015-0037","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0037","url":null,"abstract":"Abstract Differential methylation of regulatory elements is critical in epigenetic researches and can be statistically tested. We developed a new statistical test, the generalized integrated functional test (GIFT), that tests for regional differences in methylation based on the methylation percent at each CpG site within a genomic region. The GIFT uses estimated subject-specific profiles with smoothing methods, specifically wavelet smoothing, and calculates an ANOVA-like test to compare the average profile of groups. In this way, possibly correlated CpG sites within the regulatory region are compared all together. Simulations and analyses of data obtained from patients with chronic lymphocytic leukemia indicate that GIFT has good statistical properties and is able to identify promising genomic regions. Further, GIFT is likely to work with multiple different types of experiments since different smoothing methods can be used to estimate the profiles of data without noise. Matlab code for GIFT and sample data are available at http://www.augusta.edu/mcg/biostatepi/people/software/gift.html.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"237 - 251"},"PeriodicalIF":0.9,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Stability selection for lasso, ridge and elastic net implemented with AFT models 利用AFT模型实现套索、脊网和弹性网的稳定性选择
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-04-25 DOI: 10.1515/sagmb-2017-0001
M. H. R. Khan, Anamika Bhadra, Tamanna Howlader
Abstract The instability in the selection of models is a major concern with data sets containing a large number of covariates. We focus on stability selection which is used as a technique to improve variable selection performance for a range of selection methods, based on aggregating the results of applying a selection procedure to sub-samples of the data where the observations are subject to right censoring. The accelerated failure time (AFT) models have proved useful in many contexts including the heavy censoring (as for example in cancer survival) and the high dimensionality (as for example in micro-array data). We implement the stability selection approach using three variable selection techniques—Lasso, ridge regression, and elastic net applied to censored data using AFT models. We compare the performances of these regularized techniques with and without stability selection approaches with simulation studies and two real data examples–a breast cancer data and a diffuse large B-cell lymphoma data. The results suggest that stability selection gives always stable scenario about the selection of variables and that as the dimension of data increases the performance of methods with stability selection also improves compared to methods without stability selection irrespective of the collinearity between the covariates.
对于包含大量协变量的数据集,模型选择中的不稳定性是一个主要问题。我们专注于稳定性选择,这是一种用于改善一系列选择方法的变量选择性能的技术,基于将选择程序应用于观测数据的子样本的结果聚合在一起,其中观察结果受到正确的审查。加速失效时间(AFT)模型已被证明在许多情况下是有用的,包括重审查(例如在癌症生存中)和高维(例如在微阵列数据中)。我们使用三种变量选择技术——lasso、脊回归和弹性网来实现稳定性选择方法,这些技术应用于使用AFT模型的截尾数据。我们通过模拟研究和两个真实数据示例(乳腺癌数据和弥漫性大b细胞淋巴瘤数据)比较了这些正则化技术在使用和不使用稳定性选择方法时的性能。结果表明,稳定性选择给出了始终稳定的变量选择场景,并且随着数据维数的增加,无论协变量之间是否共线性,具有稳定性选择的方法的性能也比不具有稳定性选择的方法有所提高。
{"title":"Stability selection for lasso, ridge and elastic net implemented with AFT models","authors":"M. H. R. Khan, Anamika Bhadra, Tamanna Howlader","doi":"10.1515/sagmb-2017-0001","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0001","url":null,"abstract":"Abstract The instability in the selection of models is a major concern with data sets containing a large number of covariates. We focus on stability selection which is used as a technique to improve variable selection performance for a range of selection methods, based on aggregating the results of applying a selection procedure to sub-samples of the data where the observations are subject to right censoring. The accelerated failure time (AFT) models have proved useful in many contexts including the heavy censoring (as for example in cancer survival) and the high dimensionality (as for example in micro-array data). We implement the stability selection approach using three variable selection techniques—Lasso, ridge regression, and elastic net applied to censored data using AFT models. We compare the performances of these regularized techniques with and without stability selection approaches with simulation studies and two real data examples–a breast cancer data and a diffuse large B-cell lymphoma data. The results suggest that stability selection gives always stable scenario about the selection of variables and that as the dimension of data increases the performance of methods with stability selection also improves compared to methods without stability selection irrespective of the collinearity between the covariates.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2016-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A Markov random field-based approach for joint estimation of differentially expressed genes in mouse transcriptome data. 基于马尔可夫随机场的小鼠转录组数据差异表达基因联合估计方法。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2016-04-01 DOI: 10.1515/sagmb-2015-0070
Zhixiang Lin, Mingfeng Li, Nenad Sestan, Hongyu Zhao

The statistical methodology developed in this study was motivated by our interest in studying neurodevelopment using the mouse brain RNA-Seq data set, where gene expression levels were measured in multiple layers in the somatosensory cortex across time in both female and male samples. We aim to identify differentially expressed genes between adjacent time points, which may provide insights on the dynamics of brain development. Because of the extremely small sample size (one male and female at each time point), simple marginal analysis may be underpowered. We propose a Markov random field (MRF)-based approach to capitalizing on the between layers similarity, temporal dependency and the similarity between sex. The model parameters are estimated by an efficient EM algorithm with mean field-like approximation. Simulation results and real data analysis suggest that the proposed model improves the power to detect differentially expressed genes than simple marginal analysis. Our method also reveals biologically interesting results in the mouse brain RNA-Seq data set.

本研究采用的统计方法源自我们对利用小鼠大脑 RNA-Seq 数据集研究神经发育的兴趣,在该数据集中,我们测量了雌性和雄性样本体感皮层多层基因在不同时间的表达水平。我们的目标是找出相邻时间点之间表达不同的基因,从而为了解大脑发育的动态提供依据。由于样本量极小(每个时间点只有一名男性和一名女性),简单的边际分析可能无法达到预期效果。我们提出了一种基于马尔可夫随机场(MRF)的方法,以利用层间相似性、时间依赖性和性别相似性。模型参数通过有效的 EM 算法进行估计,并采用类似均值场的近似方法。模拟结果和实际数据分析表明,与简单的边际分析相比,所提出的模型提高了检测差异表达基因的能力。我们的方法还揭示了小鼠大脑 RNA-Seq 数据集中有趣的生物学结果。
{"title":"A Markov random field-based approach for joint estimation of differentially expressed genes in mouse transcriptome data.","authors":"Zhixiang Lin, Mingfeng Li, Nenad Sestan, Hongyu Zhao","doi":"10.1515/sagmb-2015-0070","DOIUrl":"10.1515/sagmb-2015-0070","url":null,"abstract":"<p><p>The statistical methodology developed in this study was motivated by our interest in studying neurodevelopment using the mouse brain RNA-Seq data set, where gene expression levels were measured in multiple layers in the somatosensory cortex across time in both female and male samples. We aim to identify differentially expressed genes between adjacent time points, which may provide insights on the dynamics of brain development. Because of the extremely small sample size (one male and female at each time point), simple marginal analysis may be underpowered. We propose a Markov random field (MRF)-based approach to capitalizing on the between layers similarity, temporal dependency and the similarity between sex. The model parameters are estimated by an efficient EM algorithm with mean field-like approximation. Simulation results and real data analysis suggest that the proposed model improves the power to detect differentially expressed genes than simple marginal analysis. Our method also reveals biologically interesting results in the mouse brain RNA-Seq data set. </p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"139-50"},"PeriodicalIF":0.9,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5587217/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1