Statistical Applications in Genetics and Molecular Biology最新文献_第9页

Non-parametric estimation of population size changes from the site frequency spectrum 基于站点频谱的人口规模变化的非参数估计

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-04-07 DOI: 10.1101/125351

B. L. Waltoft, A. Hobolth

Abstract Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n − 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n − i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.

摘要种群规模的变化是了解一个物种进化史的有用数量。一个物种内部的遗传变异可以通过位点频谱（SFS）来概括。对于大小为n的样本，SFS是长度为n−1的载体，其中条目i是突变碱基出现i次和祖先碱基出现n−i次的位点数量。我们提出了一种新的方法，CubSFS，用于从观测到的SFS中估计泛米体种群的种群大小变化。首先，我们为仅取决于种群大小的预期站点频谱的表达提供了直接的证明。我们的推导是基于瞬时聚结速率矩阵的特征值分解。其次，我们解决了从观测到的SFS中确定种群大小变化的反问题。我们的解决方案是基于种群大小的三次样条曲线。三次样条曲线是通过最小化两项的加权平均值来确定的，即（i）对观测SFS的拟合优度，以及（ii）基于变化平滑度的惩罚项。重量通过交叉验证确定。新方法在模拟的人口统计学历史上得到了验证，并应用于1000基因组项目中26个不同人群的展开和折叠SFS。

{"title":"Non-parametric estimation of population size changes from the site frequency spectrum","authors":"B. L. Waltoft, A. Hobolth","doi":"10.1101/125351","DOIUrl":"https://doi.org/10.1101/125351","url":null,"abstract":"Abstract Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n − 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n − i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2017-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47621345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Polyunphased: an extension to polytomous outcomes of the Unphased package for family-based genetic association analysis. Polyunphased：基于家庭的遗传关联分析的Unphased软件包的多元体结果的扩展

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-03-01 DOI: 10.1515/sagmb-2016-0035

Alexandre Bureau, Jordie Croteau

Polytomous phenotypes arise when a disease has multiple subtypes or when two dichotomous phenotypes are analyzed simultaneously. Few software programs offer the option to analyze such phenotypes in family studies, and none implements conditional polytomous logistic regression for within-family analysis robust to population stratification. We introduce Polyunphased, an extension to polytomous phenotypes of the Unphased package, a flexible software tool for genetic association analysis in nuclear families. Like Unphased, Polyunphased is written in C++ and runs from the command line or from a Java graphical user interface. Most Unphased options remain available in Polyunphased, including those handling missing parental genotypes while preserving robustness to population stratification, and the modelling options. Simulation studies confirmed the expected statistical behaviour of the maximum likelihood estimates of the association parameters of the conditional logistic regression model when the corresponding association parameters in the parental term of the likelihood function are set to 0, but revealed convergence problems when estimating these parental association parameters separately. The former approach is thus recommended with polytomous phenotypes.

摘要当一种疾病具有多种亚型或同时分析两种二分表型时，就会出现多染色体表型。在家族研究中，很少有软件程序提供分析此类表型的选项，也没有一个软件程序实现对群体分层稳健的家族内分析的条件多元逻辑回归。我们介绍了Polyunphased，它是Unphased软件包的多晶表型的扩展，是一种用于核心家族遗传关联分析的灵活软件工具。与Unphased一样，Polyunphased是用C++编写的，可以从命令行或Java图形用户界面运行。在Polyunphased中，大多数无相位选项仍然可用，包括那些处理缺失的亲本基因型同时保持对群体分层的稳健性的选项，以及建模选项。模拟研究证实了当似然函数的父母项中的相应关联参数设置为0时，条件逻辑回归模型的关联参数的最大似然估计的预期统计行为，但揭示了单独估计这些父母关联参数时的收敛问题。因此，前一种方法被推荐用于多胞表型。

{"title":"Polyunphased: an extension to polytomous outcomes of the Unphased package for family-based genetic association analysis.","authors":"Alexandre Bureau, Jordie Croteau","doi":"10.1515/sagmb-2016-0035","DOIUrl":"10.1515/sagmb-2016-0035","url":null,"abstract":"<p><p>Polytomous phenotypes arise when a disease has multiple subtypes or when two dichotomous phenotypes are analyzed simultaneously. Few software programs offer the option to analyze such phenotypes in family studies, and none implements conditional polytomous logistic regression for within-family analysis robust to population stratification. We introduce Polyunphased, an extension to polytomous phenotypes of the Unphased package, a flexible software tool for genetic association analysis in nuclear families. Like Unphased, Polyunphased is written in C++ and runs from the command line or from a Java graphical user interface. Most Unphased options remain available in Polyunphased, including those handling missing parental genotypes while preserving robustness to population stratification, and the modelling options. Simulation studies confirmed the expected statistical behaviour of the maximum likelihood estimates of the association parameters of the conditional logistic regression model when the corresponding association parameters in the parental term of the likelihood function are set to 0, but revealed convergence problems when estimating these parental association parameters separately. The former approach is thus recommended with polytomous phenotypes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"75-81"},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5576982/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42874518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generalized partial linear varying multi-index coefficient model for gene-environment interactions 基因-环境相互作用的广义偏线性变多指标系数模型

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-03-01 DOI: 10.1515/sagmb-2016-0045

Xu Liu, Bin Gao, Yuehua Cui

Abstract Epidemiological studies have suggested the joint effect of simultaneous exposures to multiple environments on disease risk. However, how environmental mixtures as a whole jointly modify genetic effect on disease risk is still largely unknown. Given the importance of gene-environment (G×E) interactions on many complex diseases, rigorously assessing the interaction effect between genes and environmental mixtures as a whole could shed novel insights into the etiology of complex diseases. For this purpose, we propose a generalized partial linear varying multi-index coefficient model (GPLVMICM) to capture the genetic effect on disease risk modulated by multiple environments as a whole. GPLVMICM is semiparametric in nature which allows different index loading parameters in different index functions. We estimate the parametric parameters by a profile procedure, and the nonparametric index functions by a B-spline backfitted kernel method. Under some regularity conditions, the proposed parametric and nonparametric estimators are shown to be consistent and asymptotically normal. We propose a generalized likelihood ratio (GLR) test to rigorously assess the linearity of the interaction effect between multiple environments and a gene, while apply a parametric likelihood test to detect linear G×E interaction effect. The finite sample performance of the proposed method is examined through simulation studies and is further illustrated through a real data analysis.

摘要流行病学研究表明，同时暴露于多种环境对疾病风险的联合影响。然而，环境混合物作为一个整体如何共同改变遗传对疾病风险的影响在很大程度上仍然未知。鉴于基因-环境（G×E）相互作用对许多复杂疾病的重要性，严格评估基因和环境混合物之间的相互作用效应可以为复杂疾病的病因提供新的见解。为此，我们提出了一个广义偏线性变多指标系数模型（GPLVMICM），以捕捉遗传对多个环境整体调节的疾病风险的影响。GPLVMMCM本质上是半参数的，它允许在不同的索引函数中使用不同的索引加载参数。我们用轮廓法估计参数，用B样条反推核方法估计非参数指标函数。在一定的正则性条件下，证明了所提出的参数和非参数估计是一致的和渐近正态的。我们提出了一种广义似然比（GLR）检验来严格评估多个环境和基因之间相互作用效应的线性，同时应用参数似然检验来检测线性G×E相互作用效应。通过仿真研究检验了该方法的有限样本性能，并通过实际数据分析进一步说明了该方法。

{"title":"Generalized partial linear varying multi-index coefficient model for gene-environment interactions","authors":"Xu Liu, Bin Gao, Yuehua Cui","doi":"10.1515/sagmb-2016-0045","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0045","url":null,"abstract":"Abstract Epidemiological studies have suggested the joint effect of simultaneous exposures to multiple environments on disease risk. However, how environmental mixtures as a whole jointly modify genetic effect on disease risk is still largely unknown. Given the importance of gene-environment (G×E) interactions on many complex diseases, rigorously assessing the interaction effect between genes and environmental mixtures as a whole could shed novel insights into the etiology of complex diseases. For this purpose, we propose a generalized partial linear varying multi-index coefficient model (GPLVMICM) to capture the genetic effect on disease risk modulated by multiple environments as a whole. GPLVMICM is semiparametric in nature which allows different index loading parameters in different index functions. We estimate the parametric parameters by a profile procedure, and the nonparametric index functions by a B-spline backfitted kernel method. Under some regularity conditions, the proposed parametric and nonparametric estimators are shown to be consistent and asymptotically normal. We propose a generalized likelihood ratio (GLR) test to rigorously assess the linearity of the interaction effect between multiple environments and a gene, while apply a parametric likelihood test to detect linear G×E interaction effect. The finite sample performance of the proposed method is examined through simulation studies and is further illustrated through a real data analysis.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"59 - 74"},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44875724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Statistical models and computational algorithms for discovering relationships in microbiome data 发现微生物组数据关系的统计模型和计算算法

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-01-01 DOI: 10.1515/sagmb-2015-0096

M. Shaikh, J. Beyene

Abstract Microbiomes, populations of microscopic organisms, have been found to be related to human health and it is expected further investigations will lead to novel perspectives of disease. The data used to analyze microbiomes is one of the newest types (the result of high-throughput technology) and the means to analyze these data is still rapidly evolving. One of the distributions that have been introduced into the microbiome literature, the Dirichlet-Multinomial, has received considerable attention. We extend this distribution’s use uncover compositional relationships between organisms at a taxonomic level. We apply our new method in two real microbiome data sets: one from human nasal passages and another from human stool samples.

微生物群，即微生物种群，已被发现与人类健康有关，预计进一步的研究将带来新的疾病视角。用于分析微生物组的数据是最新类型之一(高通量技术的结果)，分析这些数据的手段仍在迅速发展。其中一个已经被引入微生物组文献的分布，Dirichlet-Multinomial，已经受到了相当大的关注。我们扩展了这种分布的用途，揭示了生物在分类学水平上的组成关系。我们将我们的新方法应用于两个真实的微生物组数据集:一个来自人类鼻腔通道，另一个来自人类粪便样本。

引用次数: 0

Sample size calculation based on generalized linear models for differential expression analysis in RNA-seq data 基于广义线性模型的RNA-seq数据差分表达分析样本量计算

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-12-01 DOI: 10.1515/sagmb-2016-0008

Chung-I Li, Y. Shyr

Abstract As RNA-seq rapidly develops and costs continually decrease, the quantity and frequency of samples being sequenced will grow exponentially. With proteomic investigations becoming more multivariate and quantitative, determining a study’s optimal sample size is now a vital step in experimental design. Current methods for calculating a study’s required sample size are mostly based on the hypothesis testing framework, which assumes each gene count can be modeled through Poisson or negative binomial distributions; however, these methods are limited when it comes to accommodating covariates. To address this limitation, we propose an estimating procedure based on the generalized linear model. This easy-to-use method constructs a representative exemplary dataset and estimates the conditional power, all without requiring complicated mathematical approximations or formulas. Even more attractive, the downstream analysis can be performed with current R/Bioconductor packages. To demonstrate the practicability and efficiency of this method, we apply it to three real-world studies, and introduce our on-line calculator developed to determine the optimal sample size for a RNA-seq study.

随着RNA-seq技术的快速发展和成本的不断降低，测序样品的数量和频率将呈指数级增长。随着蛋白质组学研究变得更加多元和定量，确定研究的最佳样本量现在是实验设计的重要一步。目前计算研究所需样本大小的方法主要基于假设检验框架，该框架假设每个基因计数可以通过泊松分布或负二项分布进行建模;然而，当涉及到容纳协变量时，这些方法是有限的。为了解决这一限制，我们提出了一种基于广义线性模型的估计方法。这种易于使用的方法构建了一个代表性的示例数据集并估计条件功率，所有这些都不需要复杂的数学近似或公式。更有吸引力的是，下游分析可以使用当前的R/Bioconductor封装进行。为了证明该方法的实用性和效率，我们将其应用于三个现实世界的研究，并介绍了我们开发的在线计算器，以确定RNA-seq研究的最佳样本量。

{"title":"Sample size calculation based on generalized linear models for differential expression analysis in RNA-seq data","authors":"Chung-I Li, Y. Shyr","doi":"10.1515/sagmb-2016-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0008","url":null,"abstract":"Abstract As RNA-seq rapidly develops and costs continually decrease, the quantity and frequency of samples being sequenced will grow exponentially. With proteomic investigations becoming more multivariate and quantitative, determining a study’s optimal sample size is now a vital step in experimental design. Current methods for calculating a study’s required sample size are mostly based on the hypothesis testing framework, which assumes each gene count can be modeled through Poisson or negative binomial distributions; however, these methods are limited when it comes to accommodating covariates. To address this limitation, we propose an estimating procedure based on the generalized linear model. This easy-to-use method constructs a representative exemplary dataset and estimates the conditional power, all without requiring complicated mathematical approximations or formulas. Even more attractive, the downstream analysis can be performed with current R/Bioconductor packages. To demonstrate the practicability and efficiency of this method, we apply it to three real-world studies, and introduce our on-line calculator developed to determine the optimal sample size for a RNA-seq study.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"30 1","pages":"491 - 505"},"PeriodicalIF":0.9,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments 高通量测序或蛋白质组学实验中特征子集相关计数数据的模拟框架

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-10-01 DOI: 10.1515/sagmb-2015-0082

Jochen Kruppa, F. Kramer, T. Beissbarth, K. Jung

Abstract As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.

作为高通量测序实验数据处理的一部分，产生的计数数据代表了映射到特定基因组区域的读取量。计数数据也出现在质谱实验中，用于检测蛋白质之间的相互作用。为了评估新的计算方法来分析序列计数数据或蛋白质组学实验的光谱计数数据，因此需要人工计数数据。虽然已经提出了一些生成人工测序计数数据的方法，但它们都是模拟单次测序运行，从而忽略了个体基因组特征之间的相关结构，或者它们仅限于特定结构。我们建议从多元正态分布中提取相关数据，并对这些连续数据进行四舍五入，以获得离散计数。在我们的方法中，所需的分布参数可以以不同的方式构造或从实际计数数据估计。由于舍入影响相关结构，我们评估了已经在DNA微阵列人工表达数据中使用的收缩估计器的使用。我们的方法被证明对特征的定义子集(如单个路径或GO类别)的计数模拟很有用。

{"title":"A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments","authors":"Jochen Kruppa, F. Kramer, T. Beissbarth, K. Jung","doi":"10.1515/sagmb-2015-0082","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0082","url":null,"abstract":"Abstract As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"401 - 414"},"PeriodicalIF":0.9,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0082","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accounting for isotopic clustering in Fourier transform mass spectrometry data analysis for clinical diagnostic studies 用于临床诊断研究的傅里叶变换质谱数据分析中的同位素聚类

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-10-01 DOI: 10.1515/sagmb-2016-0005

A. Kakourou, W. Vach, S. Nicolardi, Y. V. D. van der Burgt, B. Mertens

Abstract Mass spectrometry based clinical proteomics has emerged as a powerful tool for high-throughput protein profiling and biomarker discovery. Recent improvements in mass spectrometry technology have boosted the potential of proteomic studies in biomedical research. However, the complexity of the proteomic expression introduces new statistical challenges in summarizing and analyzing the acquired data. Statistical methods for optimally processing proteomic data are currently a growing field of research. In this paper we present simple, yet appropriate methods to preprocess, summarize and analyze high-throughput MALDI-FTICR mass spectrometry data, collected in a case-control fashion, while dealing with the statistical challenges that accompany such data. The known statistical properties of the isotopic distribution of the peptide molecules are used to preprocess the spectra and translate the proteomic expression into a condensed data set. Information on either the intensity level or the shape of the identified isotopic clusters is used to derive summary measures on which diagnostic rules for disease status allocation will be based. Results indicate that both the shape of the identified isotopic clusters and the overall intensity level carry information on the class outcome and can be used to predict the presence or absence of the disease.

基于质谱的临床蛋白质组学已经成为高通量蛋白质分析和生物标志物发现的有力工具。质谱技术的最新改进提高了生物医学研究中蛋白质组学研究的潜力。然而，蛋白质组学表达的复杂性为总结和分析获得的数据带来了新的统计挑战。最佳处理蛋白质组学数据的统计方法目前是一个不断发展的研究领域。在本文中，我们提出了简单而适当的方法来预处理，总结和分析高通量MALDI-FTICR质谱数据，以病例对照的方式收集，同时处理伴随这些数据的统计挑战。利用已知的肽分子同位素分布的统计特性对光谱进行预处理，并将蛋白质组学表达转化为浓缩的数据集。关于已确定的同位素簇的强度水平或形状的信息被用于得出疾病状态分配诊断规则所依据的汇总度量。结果表明，鉴定的同位素簇的形状和总体强度水平都携带了分类结果的信息，可用于预测疾病的存在与否。

{"title":"Accounting for isotopic clustering in Fourier transform mass spectrometry data analysis for clinical diagnostic studies","authors":"A. Kakourou, W. Vach, S. Nicolardi, Y. V. D. van der Burgt, B. Mertens","doi":"10.1515/sagmb-2016-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0005","url":null,"abstract":"Abstract Mass spectrometry based clinical proteomics has emerged as a powerful tool for high-throughput protein profiling and biomarker discovery. Recent improvements in mass spectrometry technology have boosted the potential of proteomic studies in biomedical research. However, the complexity of the proteomic expression introduces new statistical challenges in summarizing and analyzing the acquired data. Statistical methods for optimally processing proteomic data are currently a growing field of research. In this paper we present simple, yet appropriate methods to preprocess, summarize and analyze high-throughput MALDI-FTICR mass spectrometry data, collected in a case-control fashion, while dealing with the statistical challenges that accompany such data. The known statistical properties of the isotopic distribution of the peptide molecules are used to preprocess the spectra and translate the proteomic expression into a condensed data set. Information on either the intensity level or the shape of the identified isotopic clusters is used to derive summary measures on which diagnostic rules for disease status allocation will be based. Results indicate that both the shape of the identified isotopic clusters and the overall intensity level carry information on the class outcome and can be used to predict the presence or absence of the disease.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"415 - 430"},"PeriodicalIF":0.9,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Model selection for factorial Gaussian graphical models with an application to dynamic regulatory networks 阶乘高斯图形模型的模型选择及其在动态调节网络中的应用

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-06-01 DOI: 10.1515/sagmb-2014-0075

V. Vinciotti, L. Augugliaro, A. Abbruzzo, E. Wit

Abstract Factorial Gaussian graphical Models (fGGMs) have recently been proposed for inferring dynamic gene regulatory networks from genomic high-throughput data. In the search for true regulatory relationships amongst the vast space of possible networks, these models allow the imposition of certain restrictions on the dynamic nature of these relationships, such as Markov dependencies of low order – some entries of the precision matrix are a priori zeros – or equal dependency strengths across time lags – some entries of the precision matrix are assumed to be equal. The precision matrix is then estimated by l1-penalized maximum likelihood, imposing a further constraint on the absolute value of its entries, which results in sparse networks. Selecting the optimal sparsity level is a major challenge for this type of approaches. In this paper, we evaluate the performance of a number of model selection criteria for fGGMs by means of two simulated regulatory networks from realistic biological processes. The analysis reveals a good performance of fGGMs in comparison with other methods for inferring dynamic networks and of the KLCV criterion in particular for model selection. Finally, we present an application on a high-resolution time-course microarray data from the Neisseria meningitidis bacterium, a causative agent of life-threatening infections such as meningitis. The methodology described in this paper is implemented in the R package sglasso, freely available at CRAN, http://CRAN.R-project.org/package=sglasso.

因子高斯图模型(fGGMs)最近被提出用于从基因组高通量数据推断动态基因调控网络。在巨大的可能网络空间中寻找真正的调节关系时，这些模型允许对这些关系的动态性质施加某些限制，例如低阶的马尔可夫依赖性-精度矩阵的一些条目是先验的零-或跨时间滞后的相等依赖强度-精度矩阵的一些条目被假设为相等。然后通过11惩罚的最大似然估计精度矩阵，对其条目的绝对值施加进一步的约束，从而产生稀疏网络。选择最优的稀疏度级别是这类方法的主要挑战。在本文中，我们通过两个模拟现实生物过程的调节网络来评估fGGMs的一些模型选择标准的性能。分析表明，与其他推断动态网络的方法和KLCV准则相比，fGGMs具有良好的性能，特别是在模型选择方面。最后，我们介绍了一种高分辨率时程微阵列数据的应用，该数据来自脑膜炎奈瑟菌，一种危及生命的感染，如脑膜炎的病原体。本文中描述的方法是在R软件包sglassso中实现的，可以在CRAN上免费获得，http://CRAN.R-project.org/package=sglasso。

{"title":"Model selection for factorial Gaussian graphical models with an application to dynamic regulatory networks","authors":"V. Vinciotti, L. Augugliaro, A. Abbruzzo, E. Wit","doi":"10.1515/sagmb-2014-0075","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0075","url":null,"abstract":"Abstract Factorial Gaussian graphical Models (fGGMs) have recently been proposed for inferring dynamic gene regulatory networks from genomic high-throughput data. In the search for true regulatory relationships amongst the vast space of possible networks, these models allow the imposition of certain restrictions on the dynamic nature of these relationships, such as Markov dependencies of low order – some entries of the precision matrix are a priori zeros – or equal dependency strengths across time lags – some entries of the precision matrix are assumed to be equal. The precision matrix is then estimated by l1-penalized maximum likelihood, imposing a further constraint on the absolute value of its entries, which results in sparse networks. Selecting the optimal sparsity level is a major challenge for this type of approaches. In this paper, we evaluate the performance of a number of model selection criteria for fGGMs by means of two simulated regulatory networks from realistic biological processes. The analysis reveals a good performance of fGGMs in comparison with other methods for inferring dynamic networks and of the KLCV criterion in particular for model selection. Finally, we present an application on a high-resolution time-course microarray data from the Neisseria meningitidis bacterium, a causative agent of life-threatening infections such as meningitis. The methodology described in this paper is implemented in the R package sglasso, freely available at CRAN, http://CRAN.R-project.org/package=sglasso.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"193 - 212"},"PeriodicalIF":0.9,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0075","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Testing differentially expressed genes in dose-response studies and with ordinal phenotypes 在剂量反应研究和顺序表型中检测差异表达基因

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-06-01 DOI: 10.1515/sagmb-2015-0091

E. Sweeney, C. Crainiceanu, J. Gertheiss

Abstract When testing for differentially expressed genes between more than two groups, the groups are often defined by dose levels in dose-response experiments or ordinal phenotypes, such as disease stages. We discuss the potential of a new approach that uses the levels’ ordering without making any structural assumptions, such as monotonicity, by testing for zero variance components in a mixed models framework. Since the mixed effects model approach borrows strength across doses/levels, the test proposed can also be applied when the number of dose levels/phenotypes is large and/or the number of subjects per group is small. We illustrate the new test in simulation studies and on several publicly available datasets and compare it to alternative testing procedures. All tests considered are implemented in R and are publicly available. The new approach offers a very fast and powerful way to test for differentially expressed genes between ordered groups without making restrictive assumptions with respect to the true relationship between factor levels and response.

当检测多于两组之间的差异表达基因时，通常根据剂量反应实验中的剂量水平或顺序表型(如疾病分期)来定义组。我们讨论了一种新方法的潜力，该方法通过测试混合模型框架中的零方差成分，在不做任何结构性假设(如单调性)的情况下使用水平排序。由于混合效应模型方法借鉴了跨剂量/水平的强度，因此所提出的检验也可以应用于剂量水平/表型数量大和/或每组受试者数量小的情况。我们在模拟研究和几个公开可用的数据集上说明了新的测试，并将其与其他测试程序进行了比较。所有考虑的测试都是用R实现的，并且是公开可用的。新方法提供了一种非常快速和强大的方法来测试有序群体之间差异表达的基因，而无需对因子水平和反应之间的真实关系做出限制性假设。

{"title":"Testing differentially expressed genes in dose-response studies and with ordinal phenotypes","authors":"E. Sweeney, C. Crainiceanu, J. Gertheiss","doi":"10.1515/sagmb-2015-0091","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0091","url":null,"abstract":"Abstract When testing for differentially expressed genes between more than two groups, the groups are often defined by dose levels in dose-response experiments or ordinal phenotypes, such as disease stages. We discuss the potential of a new approach that uses the levels’ ordering without making any structural assumptions, such as monotonicity, by testing for zero variance components in a mixed models framework. Since the mixed effects model approach borrows strength across doses/levels, the test proposed can also be applied when the number of dose levels/phenotypes is large and/or the number of subjects per group is small. We illustrate the new test in simulation studies and on several publicly available datasets and compare it to alternative testing procedures. All tests considered are implemented in R and are publicly available. The new approach offers a very fast and powerful way to test for differentially expressed genes between ordered groups without making restrictive assumptions with respect to the true relationship between factor levels and response.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"55 1","pages":"213 - 235"},"PeriodicalIF":0.9,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0091","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67003076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Differential methylation tests of regulatory regions 调控区的差异甲基化试验

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-06-01 DOI: 10.1515/sagmb-2015-0037

D. Ryu, Hongyang Xu, Varghese George, S. Su, Xiaoling Wang, Huidong Shi, R. Podolsky

Abstract Differential methylation of regulatory elements is critical in epigenetic researches and can be statistically tested. We developed a new statistical test, the generalized integrated functional test (GIFT), that tests for regional differences in methylation based on the methylation percent at each CpG site within a genomic region. The GIFT uses estimated subject-specific profiles with smoothing methods, specifically wavelet smoothing, and calculates an ANOVA-like test to compare the average profile of groups. In this way, possibly correlated CpG sites within the regulatory region are compared all together. Simulations and analyses of data obtained from patients with chronic lymphocytic leukemia indicate that GIFT has good statistical properties and is able to identify promising genomic regions. Further, GIFT is likely to work with multiple different types of experiments since different smoothing methods can be used to estimate the profiles of data without noise. Matlab code for GIFT and sample data are available at http://www.augusta.edu/mcg/biostatepi/people/software/gift.html.

调控元件的差异甲基化在表观遗传学研究中至关重要，并且可以进行统计学检验。我们开发了一种新的统计测试，即广义综合功能测试(GIFT)，该测试基于基因组区域内每个CpG位点的甲基化百分比来测试甲基化的区域差异。GIFT使用平滑方法(特别是小波平滑)估计的受试者特定特征，并计算类似anova的检验来比较各组的平均特征。通过这种方式，将调控区域内可能相关的CpG位点全部进行比较。对慢性淋巴细胞白血病患者数据的模拟和分析表明，GIFT具有良好的统计特性，能够识别有希望的基因组区域。此外，GIFT可能适用于多种不同类型的实验，因为可以使用不同的平滑方法来估计无噪声数据的轮廓。GIFT的Matlab代码和样本数据可在http://www.augusta.edu/mcg/biostatepi/people/software/gift.html获得。

{"title":"Differential methylation tests of regulatory regions","authors":"D. Ryu, Hongyang Xu, Varghese George, S. Su, Xiaoling Wang, Huidong Shi, R. Podolsky","doi":"10.1515/sagmb-2015-0037","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0037","url":null,"abstract":"Abstract Differential methylation of regulatory elements is critical in epigenetic researches and can be statistically tested. We developed a new statistical test, the generalized integrated functional test (GIFT), that tests for regional differences in methylation based on the methylation percent at each CpG site within a genomic region. The GIFT uses estimated subject-specific profiles with smoothing methods, specifically wavelet smoothing, and calculates an ANOVA-like test to compare the average profile of groups. In this way, possibly correlated CpG sites within the regulatory region are compared all together. Simulations and analyses of data obtained from patients with chronic lymphocytic leukemia indicate that GIFT has good statistical properties and is able to identify promising genomic regions. Further, GIFT is likely to work with multiple different types of experiments since different smoothing methods can be used to estimate the profiles of data without noise. Matlab code for GIFT and sample data are available at http://www.augusta.edu/mcg/biostatepi/people/software/gift.html.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"237 - 251"},"PeriodicalIF":0.9,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67002915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6