Statistical Applications in Genetics and Molecular Biology最新文献

英文中文

A statistical method for analysing cospeciation in tritrophic ecology using electrical circuit theory. 用电路理论分析三养生态共生的统计方法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-11-27 DOI: 10.1515/sagmb-2016-0049

Colleen Nooney, Stuart Barber, Arief Gusnanto, Walter R Gilks

We introduce a new method to test efficiently for cospeciation in tritrophic systems. Our method utilises an analogy with electrical circuit theory to reduce higher order systems into bitrophic data sets that retain the information of the original system. We use a sophisticated permutation scheme that weights interactions between two trophic layers based on their connection to the third layer in the system. Our method has several advantages compared to the method of Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): "Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology," Stat. Appl. Genet. Mol. Biol., 12, 679-701.]. We do not require triangular interactions to connect the three phylogenetic trees and an easily interpreted p-value is obtained in one step. Another advantage of our method is the scope for generalisation to higher order systems and phylogenetic networks. The performance of our method is compared to the methods of Hommola et al. [Hommola, K., J. E. Smith, Y. Qiu and W. R. Gilks (2009): "A permutation test of host-parasite cospeciation," Mol. Biol. Evol., 26, 1457-1468.] and Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): "Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology," Stat. Appl. Genet. Mol. Biol., 12, 679-701.] at the bitrophic and tritrophic level, respectively. This was achieved by evaluating type I error and statistical power. The results show that our method produces unbiased p-values and has comparable power overall at both trophic levels. Our method was successfully applied to a dataset of leaf-mining moths, parasitoid wasps and host plants [Lopez-Vaamonde, C., H. Godfray, S. West, C. Hansson and J. Cook (2005): "The evolution of host use and unusual reproductive strategies in achrysocharoides parasitoid wasps," J. Evol. Biol., 18, 1029-1041.], at both the bitrophic and tritrophic levels.

我们介绍了一种新的方法来有效地测试在营养系统的共共生。我们的方法利用与电路理论的类比，将高阶系统简化为保留原始系统信息的双营养数据集。我们使用了一个复杂的排列方案，根据它们与系统中第三层的连接来加权两个营养层之间的相互作用。与Mramba等人的方法相比，我们的方法有几个优势。[Mramba, L. K.， S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister和W. R. Gilks(2013):“分析多种系统发育中共同形态的排列测试:在三营养生态学中的应用，”Stat. Appl.。麝猫。摩尔。杂志。[j].中国农业科学，2012,33(2):679-701。我们不需要三角相互作用来连接三个系统发育树，并且一步即可获得易于解释的p值。我们的方法的另一个优点是推广到高阶系统和系统发育网络的范围。我们的方法与Hommola等人的方法进行了比较。[Hommola, K.， J. E. Smith, Y. Qiu和W. R. Gilks(2009):“宿主-寄生虫共种的排列测试”，《Mol. Biol》。另一个星球。， 26, 1457-1468。[Mramba, L. K, S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister和W. R. Gilks(2013):“分析多种系统发育中共同形态的排列测试:在三营养生态学中的应用”，Stat. applied。麝猫。摩尔。杂志。， 12, 679-701。]分别在两养和三养水平。这是通过评估I型误差和统计功率来实现的。结果表明，我们的方法产生无偏p值，并且在两个营养水平上都具有相当的总体能力。我们的方法成功地应用于采叶蛾、寄生蜂和寄主植物的数据集[Lopez-Vaamonde, C.， H. Godfray, S. West, C. Hansson和J. Cook(2005):“achrysocharoides寄生蜂的寄主使用和不寻常繁殖策略的进化，”J. evolution。医学杂志。， 18, 1029-1041。]，两营养型和三营养型都有。

{"title":"A statistical method for analysing cospeciation in tritrophic ecology using electrical circuit theory.","authors":"Colleen Nooney, Stuart Barber, Arief Gusnanto, Walter R Gilks","doi":"10.1515/sagmb-2016-0049","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0049","url":null,"abstract":"We introduce a new method to test efficiently for cospeciation in tritrophic systems. Our method utilises an analogy with electrical circuit theory to reduce higher order systems into bitrophic data sets that retain the information of the original system. We use a sophisticated permutation scheme that weights interactions between two trophic layers based on their connection to the third layer in the system. Our method has several advantages compared to the method of Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): \"Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology,\" Stat. Appl. Genet. Mol. Biol., 12, 679-701.]. We do not require triangular interactions to connect the three phylogenetic trees and an easily interpreted p-value is obtained in one step. Another advantage of our method is the scope for generalisation to higher order systems and phylogenetic networks. The performance of our method is compared to the methods of Hommola et al. [Hommola, K., J. E. Smith, Y. Qiu and W. R. Gilks (2009): \"A permutation test of host-parasite cospeciation,\" Mol. Biol. Evol., 26, 1457-1468.] and Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): \"Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology,\" Stat. Appl. Genet. Mol. Biol., 12, 679-701.] at the bitrophic and tritrophic level, respectively. This was achieved by evaluating type I error and statistical power. The results show that our method produces unbiased p-values and has comparable power overall at both trophic levels. Our method was successfully applied to a dataset of leaf-mining moths, parasitoid wasps and host plants [Lopez-Vaamonde, C., H. Godfray, S. West, C. Hansson and J. Cook (2005): \"The evolution of host use and unusual reproductive strategies in achrysocharoides parasitoid wasps,\" J. Evol. Biol., 18, 1029-1041.], at both the bitrophic and tritrophic levels.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"313-331"},"PeriodicalIF":0.9,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0049","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35577574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Bayesian estimation of differential transcript usage from RNA-seq data. 基于RNA-seq数据的差异转录物使用的贝叶斯估计。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-11-27 DOI: 10.1515/sagmb-2017-0005

Panagiotis Papastamoulis, Magnus Rattray

Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.

下一代测序允许鉴定由差异表达转录本组成的基因，差异表达转录本通常指的是整体表达水平的变化。差异表达的一种特殊类型是差异转录物使用(DTU)，其目标是转录物相对基因内表达的变化。本文的贡献在于:(a)将cjBitSeq的使用扩展到DTU上下文中，这是一种先前引入的贝叶斯模型，最初设计用于识别总体表达水平的变化;(b)提出了一个贝叶斯版本的DRIMSeq，这是一种用于推断DTU的频率模型。cjBitSeq是一个基于读取的模型，通过MCMC采样对每个基因的每个转录本的潜在状态空间进行完全贝叶斯推理。BayesDRIMSeq是一个基于计数的模型，它使用拉普拉斯近似来估计DTU模型对null模型的贝叶斯因子。利用最近的独立模拟研究以及真实的RNA-seq数据集，对所提出的模型进行了基准测试。我们的结果表明，贝叶斯方法在精度/召回率方面表现出与DRIMSeq相似的性能，但提供了更好的错误发现率校准。

{"title":"Bayesian estimation of differential transcript usage from RNA-seq data.","authors":"Panagiotis Papastamoulis, Magnus Rattray","doi":"10.1515/sagmb-2017-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0005","url":null,"abstract":"Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"367-386"},"PeriodicalIF":0.9,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35561338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A statistical test for detecting parent-of-origin effects when parental information is missing. 当父母的信息缺失时，用于检测父母起源效应的统计检验。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-09-26 DOI: 10.1515/sagmb-2017-0007

Chiara Sacco, Cinzia Viroli, Mario Falchi

Genomic imprinting is an epigenetic mechanism that leads to differential contributions of maternal and paternal alleles to offspring gene expression in a parent-of-origin manner. We propose a novel test for detecting the parent-of-origin effects (POEs) in genome wide genotype data from related individuals (twins) when the parental origin cannot be inferred. The proposed method exploits a finite mixture of linear mixed models: the key idea is that in the case of POEs the population can be clustered in two different groups in which the reference allele is inherited by a different parent. A further advantage of this approach is the possibility to obtain an estimation of parental effect when the parental information is missing. We will also show that the approach is flexible enough to be applicable to the general scenario of independent data. The performance of the proposed test is evaluated through a wide simulation study. The method is finally applied to known imprinted genes of the MuTHER twin study data.

基因组印记是一种表观遗传机制，导致母本和父本等位基因以亲本起源方式对后代基因表达的差异贡献。我们提出了一种新的测试，用于检测来自相关个体(双胞胎)的全基因组基因型数据中亲本起源效应(POEs)，当亲本起源无法推断。所提出的方法利用线性混合模型的有限混合:关键思想是，在poe的情况下，种群可以聚集在两个不同的群体中，其中参考等位基因由不同的亲本遗传。这种方法的另一个优点是，当亲代信息缺失时，可以获得亲代效应的估计。我们还将展示该方法足够灵活，可以适用于独立数据的一般场景。通过广泛的仿真研究对所提出的测试的性能进行了评估。最后将该方法应用于已知的MuTHER双胞胎的印迹基因研究数据。

引用次数: 0

Bayesian comparison of protein structures using partial Procrustes distance. 利用部分Procrustes距离的蛋白质结构贝叶斯比较。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-09-26 DOI: 10.1515/sagmb-2016-0014

Nasim Ejlali, Mohammad Reza Faghihi, Mehdi Sadeghi

An important topic in bioinformatics is the protein structure alignment. Some statistical methods have been proposed for this problem, but most of them align two protein structures based on the global geometric information without considering the effect of neighbourhood in the structures. In this paper, we provide a Bayesian model to align protein structures, by considering the effect of both local and global geometric information of protein structures. Local geometric information is incorporated to the model through the partial Procrustes distance of small substructures. These substructures are composed of β-carbon atoms from the side chains. Parameters are estimated using a Markov chain Monte Carlo (MCMC) approach. We evaluate the performance of our model through some simulation studies. Furthermore, we apply our model to a real dataset and assess the accuracy and convergence rate. Results show that our model is much more efficient than previous approaches.

蛋白质结构比对是生物信息学中的一个重要课题。针对这一问题，已经提出了一些统计方法，但大多数方法都是基于全局几何信息对两个蛋白质结构进行排列，而没有考虑结构中邻域的影响。本文通过考虑蛋白质结构的局部和全局几何信息的影响，提出了一种蛋白质结构对齐的贝叶斯模型。通过小子结构的局部Procrustes距离将局部几何信息纳入模型。这些亚结构是由侧链上的β-碳原子组成的。参数估计使用马尔可夫链蒙特卡罗(MCMC)方法。我们通过一些仿真研究来评估我们的模型的性能。此外，我们将该模型应用于实际数据集，并评估了准确性和收敛速度。结果表明，我们的模型比以前的方法更有效。

引用次数: 2

Confidence intervals for heritability via Haseman-Elston regression. 通过 Haseman-Elston 回归得出遗传率的置信区间。

IF 0.8 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-09-26 DOI: 10.1515/sagmb-2016-0076

Tamar Sofer

Heritability is the proportion of phenotypic variance in a population that is attributable to individual genotypes. Heritability is considered an important measure in both evolutionary biology and in medicine, and is routinely estimated and reported in genetic epidemiology studies. In population-based genome-wide association studies (GWAS), mixed models are used to estimate variance components, from which a heritability estimate is obtained. The estimated heritability is the proportion of the model's total variance that is due to the genetic relatedness matrix (kinship measured from genotypes). Current practice is to use bootstrapping, which is slow, or normal asymptotic approximation to estimate the precision of the heritability estimate; however, this approximation fails to hold near the boundaries of the parameter space or when the sample size is small. In this paper we propose to estimate variance components via a Haseman-Elston regression, find the asymptotic distribution of the variance components and proportions of variance, and use them to construct confidence intervals (CIs). Our method is further developed to obtain unbiased variance components estimators and construct CIs by meta-analyzing information from multiple studies. We demonstrate our approach on data from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).

遗传率是指群体中表型变异可归因于个体基因型的比例。遗传率在进化生物学和医学中都被认为是一个重要的衡量指标，在遗传流行病学研究中被常规估算和报告。在基于人群的全基因组关联研究（GWAS）中，使用混合模型来估计方差分量，并从中得到遗传率估计值。估计的遗传率是由遗传亲缘关系矩阵（从基因型测得的亲缘关系）引起的模型总方差的比例。目前的做法是使用引导法（速度较慢）或正态渐近法来估计遗传率估计值的精度；然而，这种近似方法在参数空间的边界附近或样本量较小时不能成立。在本文中，我们建议通过 Haseman-Elston 回归估计方差分量，找到方差分量和方差比例的渐近分布，并利用它们构建置信区间（CI）。我们的方法得到了进一步发展，可以通过元分析多项研究的信息来获得无偏的方差分量估计值并构建置信区间。我们在西班牙裔社区健康研究/拉美裔研究（HCHS/SOL）的数据中演示了我们的方法。

{"title":"Confidence intervals for heritability via Haseman-Elston regression.","authors":"Tamar Sofer","doi":"10.1515/sagmb-2016-0076","DOIUrl":"10.1515/sagmb-2016-0076","url":null,"abstract":"Heritability is the proportion of phenotypic variance in a population that is attributable to individual genotypes. Heritability is considered an important measure in both evolutionary biology and in medicine, and is routinely estimated and reported in genetic epidemiology studies. In population-based genome-wide association studies (GWAS), mixed models are used to estimate variance components, from which a heritability estimate is obtained. The estimated heritability is the proportion of the model's total variance that is due to the genetic relatedness matrix (kinship measured from genotypes). Current practice is to use bootstrapping, which is slow, or normal asymptotic approximation to estimate the precision of the heritability estimate; however, this approximation fails to hold near the boundaries of the parameter space or when the sample size is small. In this paper we propose to estimate variance components via a Haseman-Elston regression, find the asymptotic distribution of the variance components and proportions of variance, and use them to construct confidence intervals (CIs). Our method is further developed to obtain unbiased variance components estimators and construct CIs by meta-analyzing information from multiple studies. We demonstrate our approach on data from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 4","pages":"259-273"},"PeriodicalIF":0.8,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5857391/pdf/nihms922922.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35318749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FC1000: normalized gene expression changes of systematically perturbed human cells. FC1000:系统扰动人类细胞的归一化基因表达变化。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-09-26 DOI: 10.1515/sagmb-2016-0072

Ingrid M Lönnstedt, Sven Nelander

The systematic study of transcriptional responses to genetic and chemical perturbations in human cells is still in its early stages. The largest available dataset to date is the newly released L1000 compendium. With its 1.3 million gene expression profiles of treated human cells it offers many opportunities for biomedical data mining, but also data normalization challenges of new dimensions. We developed a novel and practical approach to obtain accurate estimates of fold change response profiles from L1000, based on the RUV (Remove Unwanted Variation) statistical framework. Extending RUV to a big data setting, we propose an estimation procedure, in which an underlying RUV model is tuned by feedback through dataset specific statistical measures, reflecting p-value distributions and internal gene knockdown controls. Applying these metrics - termed evaluation endpoints - to disjoint data splits and integrating the results to select an optimal normalization, the procedure reduces bias and noise in the L1000 data, which in turn broadens the potential of this resource for pharmacological and functional genomic analyses. Our pipeline and normalization results are distributed as an R package (nelanderlab.org/FC1000.html).

对人类细胞中遗传和化学扰动的转录反应的系统研究仍处于早期阶段。迄今为止最大的可用数据集是新发布的L1000纲要。它拥有130万个处理过的人类细胞的基因表达谱，为生物医学数据挖掘提供了许多机会，但也为数据规范化带来了新的挑战。我们开发了一种新颖实用的方法，基于RUV(去除不必要的变化)统计框架，获得L1000的折叠变化响应曲线的准确估计。将RUV扩展到大数据环境，我们提出了一种估计过程，其中底层RUV模型通过数据集特定统计措施的反馈进行调整，反映p值分布和内部基因敲低控制。将这些指标(称为评估终点)应用于不相交的数据分割并整合结果以选择最佳归一化，该过程减少了L1000数据中的偏差和噪声，从而扩大了该资源用于药理学和功能基因组分析的潜力。我们的管道和规范化结果作为R包发布(nelanderlab.org/FC1000.html)。

{"title":"FC1000: normalized gene expression changes of systematically perturbed human cells.","authors":"Ingrid M Lönnstedt, Sven Nelander","doi":"10.1515/sagmb-2016-0072","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0072","url":null,"abstract":"The systematic study of transcriptional responses to genetic and chemical perturbations in human cells is still in its early stages. The largest available dataset to date is the newly released L1000 compendium. With its 1.3 million gene expression profiles of treated human cells it offers many opportunities for biomedical data mining, but also data normalization challenges of new dimensions. We developed a novel and practical approach to obtain accurate estimates of fold change response profiles from L1000, based on the RUV (Remove Unwanted Variation) statistical framework. Extending RUV to a big data setting, we propose an estimation procedure, in which an underlying RUV model is tuned by feedback through dataset specific statistical measures, reflecting p-value distributions and internal gene knockdown controls. Applying these metrics - termed evaluation endpoints - to disjoint data splits and integrating the results to select an optimal normalization, the procedure reduces bias and noise in the L1000 data, which in turn broadens the potential of this resource for pharmacological and functional genomic analyses. Our pipeline and normalization results are distributed as an R package (nelanderlab.org/FC1000.html).","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 4","pages":"217-242"},"PeriodicalIF":0.9,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35318753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Genetic association test based on principal component analysis. 基于主成分分析的遗传关联检验。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-07-26 DOI: 10.1515/sagmb-2016-0061

Zhongxue Chen, Shizhong Han, Kai Wang

Many gene- and pathway-based association tests have been proposed in the literature. Among them, the SKAT is widely used, especially for rare variants association studies. In this paper, we investigate the connection between SKAT and a principal component analysis. This investigation leads to a procedure that encompasses SKAT as a special case. Through simulation studies and real data applications, we compare the proposed method with some existing tests.

文献中提出了许多基于基因和通路的关联测试。其中，SKAT被广泛应用于罕见变异体关联研究。在本文中，我们研究了SKAT和主成分分析之间的联系。这项调查导致了一个程序，包括SKAT作为一个特殊情况。通过仿真研究和实际数据应用，将该方法与现有的一些测试方法进行了比较。

引用次数: 10

Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration. 比较高维基因组数据整合中线性和非线性主成分的性能。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-07-26 DOI: 10.1515/sagmb-2016-0066

Shofiqul Islam, Sonia Anand, Jemila Hamid, Lehana Thabane, Joseph Beyene

Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.

线性主成分分析(PCA)是一种广泛使用的方法，用于降低基因或miRNA表达数据集的维数。这种方法依赖于线性假设，往往无法捕捉数据中固有的模式和关系。因此，像核主成分分析这样的非线性方法可能是最优的。我们开发了一种基于copula的仿真算法，该算法考虑了在这些数据集中观察到的依赖程度和非线性。使用该算法，我们进行了广泛的模拟，以比较线性和核主成分分析方法在数据集成和死亡分类方面的性能。我们还使用肺癌患者基因和miRNA表达的真实数据集来比较这些方法。在这种情况下，与线性主成分相比，前几个核主成分表现出较差的性能。使用线性PCA和逻辑回归模型进行分类的降维似乎足以满足此目的。使用这两种方法中的任何一种来集成来自多个数据集的信息，可以提高结果的分类精度。

{"title":"Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration.","authors":"Shofiqul Islam, Sonia Anand, Jemila Hamid, Lehana Thabane, Joseph Beyene","doi":"10.1515/sagmb-2016-0066","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0066","url":null,"abstract":"Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 3","pages":"199-216"},"PeriodicalIF":0.9,"publicationDate":"2017-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35184782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study. 稀疏高维多元回归中的正则化估计，并应用于DNA甲基化研究。

IF 0.8 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-07-26 DOI: 10.1515/sagmb-2016-0073

Haixiang Zhang, Yinan Zheng, Grace Yoon, Zhou Zhang, Tao Gao, Brian Joyce, Wei Zhang, Joel Schwartz, Pantel Vokonas, Elena Colicino, Andrea Baccarelli, Lifang Hou, Lei Liu

In this article, we consider variable selection for correlated high dimensional DNA methylation markers as multivariate outcomes. A novel weighted square-root LASSO procedure is proposed to estimate the regression coefficient matrix. A key feature of this method is tuning-insensitivity, which greatly simplifies the computation by obviating cross validation for penalty parameter selection. A precision matrix obtained via the constrained ℓ1 minimization method is used to account for the within-subject correlation among multivariate outcomes. Oracle inequalities of the regularized estimators are derived. The performance of our proposed method is illustrated via extensive simulation studies. We apply our method to study the relation between smoking and high dimensional DNA methylation markers in the Normative Aging Study (NAS).

在本文中，我们考虑相关高维DNA甲基化标记的变量选择作为多变量结果。提出了一种新的加权平方根LASSO方法来估计回归系数矩阵。该方法的一个关键特点是调优不敏感，避免了惩罚参数选择的交叉验证，大大简化了计算。通过约束最小化方法得到的精度矩阵用于解释多变量结果之间的主体内相关性。推导了正则估计量的Oracle不等式。我们提出的方法的性能是通过广泛的仿真研究说明。我们应用我们的方法在规范衰老研究（NAS）中研究吸烟与高维DNA甲基化标记之间的关系。

引用次数: 0

Mixture model-based association analysis with case-control data in genome wide association studies. 全基因组关联研究中基于混合模型的关联分析与病例对照数据。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-07-26 DOI: 10.1515/sagmb-2016-0022

Fadhaa Ali, Jian Zhang

Multilocus haplotype analysis of candidate variants with genome wide association studies (GWAS) data may provide evidence of association with disease, even when the individual loci themselves do not. Unfortunately, when a large number of candidate variants are investigated, identifying risk haplotypes can be very difficult. To meet the challenge, a number of approaches have been put forward in recent years. However, most of them are not directly linked to the disease-penetrances of haplotypes and thus may not be efficient. To fill this gap, we propose a mixture model-based approach for detecting risk haplotypes. Under the mixture model, haplotypes are clustered directly according to their estimated disease penetrances. A theoretical justification of the above model is provided. Furthermore, we introduce a hypothesis test for haplotype inheritance patterns which underpin this model. The performance of the proposed approach is evaluated by simulations and real data analysis. The results show that the proposed approach outperforms an existing multiple testing method.

候选变异的多位点单倍型分析与全基因组关联研究(GWAS)数据可能提供与疾病相关的证据，即使单个位点本身没有。不幸的是，当大量候选变异被研究时，识别风险单倍型是非常困难的。为了应对这一挑战，近年来提出了许多方法。然而，它们中的大多数与单倍型的疾病外显率没有直接联系，因此可能不是有效的。为了填补这一空白，我们提出了一种基于混合模型的方法来检测风险单倍型。在混合模型下，单倍型根据其估计的疾病外显率直接聚类。对上述模型进行了理论论证。此外，我们引入了支持该模型的单倍型遗传模式的假设检验。通过仿真和实际数据分析对该方法的性能进行了评价。结果表明，该方法优于现有的多重测试方法。

{"title":"Mixture model-based association analysis with case-control data in genome wide association studies.","authors":"Fadhaa Ali, Jian Zhang","doi":"10.1515/sagmb-2016-0022","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0022","url":null,"abstract":"Multilocus haplotype analysis of candidate variants with genome wide association studies (GWAS) data may provide evidence of association with disease, even when the individual loci themselves do not. Unfortunately, when a large number of candidate variants are investigated, identifying risk haplotypes can be very difficult. To meet the challenge, a number of approaches have been put forward in recent years. However, most of them are not directly linked to the disease-penetrances of haplotypes and thus may not be efficient. To fill this gap, we propose a mixture model-based approach for detecting risk haplotypes. Under the mixture model, haplotypes are clustered directly according to their estimated disease penetrances. A theoretical justification of the above model is provided. Furthermore, we introduce a hypothesis test for haplotype inheritance patterns which underpin this model. The performance of the proposed approach is evaluated by simulations and real data analysis. The results show that the proposed approach outperforms an existing multiple testing method.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 3","pages":"173-187"},"PeriodicalIF":0.9,"publicationDate":"2017-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35182457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Statistical Applications in Genetics and Molecular Biology

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀