Colleen Nooney, Stuart Barber, Arief Gusnanto, Walter R Gilks
We introduce a new method to test efficiently for cospeciation in tritrophic systems. Our method utilises an analogy with electrical circuit theory to reduce higher order systems into bitrophic data sets that retain the information of the original system. We use a sophisticated permutation scheme that weights interactions between two trophic layers based on their connection to the third layer in the system. Our method has several advantages compared to the method of Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): "Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology," Stat. Appl. Genet. Mol. Biol., 12, 679-701.]. We do not require triangular interactions to connect the three phylogenetic trees and an easily interpreted p-value is obtained in one step. Another advantage of our method is the scope for generalisation to higher order systems and phylogenetic networks. The performance of our method is compared to the methods of Hommola et al. [Hommola, K., J. E. Smith, Y. Qiu and W. R. Gilks (2009): "A permutation test of host-parasite cospeciation," Mol. Biol. Evol., 26, 1457-1468.] and Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): "Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology," Stat. Appl. Genet. Mol. Biol., 12, 679-701.] at the bitrophic and tritrophic level, respectively. This was achieved by evaluating type I error and statistical power. The results show that our method produces unbiased p-values and has comparable power overall at both trophic levels. Our method was successfully applied to a dataset of leaf-mining moths, parasitoid wasps and host plants [Lopez-Vaamonde, C., H. Godfray, S. West, C. Hansson and J. Cook (2005): "The evolution of host use and unusual reproductive strategies in achrysocharoides parasitoid wasps," J. Evol. Biol., 18, 1029-1041.], at both the bitrophic and tritrophic levels.
我们介绍了一种新的方法来有效地测试在营养系统的共共生。我们的方法利用与电路理论的类比,将高阶系统简化为保留原始系统信息的双营养数据集。我们使用了一个复杂的排列方案,根据它们与系统中第三层的连接来加权两个营养层之间的相互作用。与Mramba等人的方法相比,我们的方法有几个优势。[Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister和W. R. Gilks(2013):“分析多种系统发育中共同形态的排列测试:在三营养生态学中的应用,”Stat. Appl.。麝猫。摩尔。杂志。[j].中国农业科学,2012,33(2):679-701。我们不需要三角相互作用来连接三个系统发育树,并且一步即可获得易于解释的p值。我们的方法的另一个优点是推广到高阶系统和系统发育网络的范围。我们的方法与Hommola等人的方法进行了比较。[Hommola, K., J. E. Smith, Y. Qiu和W. R. Gilks(2009):“宿主-寄生虫共种的排列测试”,《Mol. Biol》。另一个星球。, 26, 1457-1468。[Mramba, L. K, S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister和W. R. Gilks(2013):“分析多种系统发育中共同形态的排列测试:在三营养生态学中的应用”,Stat. applied。麝猫。摩尔。杂志。, 12, 679-701。]分别在两养和三养水平。这是通过评估I型误差和统计功率来实现的。结果表明,我们的方法产生无偏p值,并且在两个营养水平上都具有相当的总体能力。我们的方法成功地应用于采叶蛾、寄生蜂和寄主植物的数据集[Lopez-Vaamonde, C., H. Godfray, S. West, C. Hansson和J. Cook(2005):“achrysocharoides寄生蜂的寄主使用和不寻常繁殖策略的进化,”J. evolution。医学杂志。, 18, 1029-1041。],两营养型和三营养型都有。
{"title":"A statistical method for analysing cospeciation in tritrophic ecology using electrical circuit theory.","authors":"Colleen Nooney, Stuart Barber, Arief Gusnanto, Walter R Gilks","doi":"10.1515/sagmb-2016-0049","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0049","url":null,"abstract":"<p><p>We introduce a new method to test efficiently for cospeciation in tritrophic systems. Our method utilises an analogy with electrical circuit theory to reduce higher order systems into bitrophic data sets that retain the information of the original system. We use a sophisticated permutation scheme that weights interactions between two trophic layers based on their connection to the third layer in the system. Our method has several advantages compared to the method of Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): \"Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology,\" Stat. Appl. Genet. Mol. Biol., 12, 679-701.]. We do not require triangular interactions to connect the three phylogenetic trees and an easily interpreted p-value is obtained in one step. Another advantage of our method is the scope for generalisation to higher order systems and phylogenetic networks. The performance of our method is compared to the methods of Hommola et al. [Hommola, K., J. E. Smith, Y. Qiu and W. R. Gilks (2009): \"A permutation test of host-parasite cospeciation,\" Mol. Biol. Evol., 26, 1457-1468.] and Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): \"Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology,\" Stat. Appl. Genet. Mol. Biol., 12, 679-701.] at the bitrophic and tritrophic level, respectively. This was achieved by evaluating type I error and statistical power. The results show that our method produces unbiased p-values and has comparable power overall at both trophic levels. Our method was successfully applied to a dataset of leaf-mining moths, parasitoid wasps and host plants [Lopez-Vaamonde, C., H. Godfray, S. West, C. Hansson and J. Cook (2005): \"The evolution of host use and unusual reproductive strategies in achrysocharoides parasitoid wasps,\" J. Evol. Biol., 18, 1029-1041.], at both the bitrophic and tritrophic levels.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"313-331"},"PeriodicalIF":0.9,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0049","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35577574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.
{"title":"Bayesian estimation of differential transcript usage from RNA-seq data.","authors":"Panagiotis Papastamoulis, Magnus Rattray","doi":"10.1515/sagmb-2017-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0005","url":null,"abstract":"<p><p>Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"367-386"},"PeriodicalIF":0.9,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35561338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genomic imprinting is an epigenetic mechanism that leads to differential contributions of maternal and paternal alleles to offspring gene expression in a parent-of-origin manner. We propose a novel test for detecting the parent-of-origin effects (POEs) in genome wide genotype data from related individuals (twins) when the parental origin cannot be inferred. The proposed method exploits a finite mixture of linear mixed models: the key idea is that in the case of POEs the population can be clustered in two different groups in which the reference allele is inherited by a different parent. A further advantage of this approach is the possibility to obtain an estimation of parental effect when the parental information is missing. We will also show that the approach is flexible enough to be applicable to the general scenario of independent data. The performance of the proposed test is evaluated through a wide simulation study. The method is finally applied to known imprinted genes of the MuTHER twin study data.
{"title":"A statistical test for detecting parent-of-origin effects when parental information is missing.","authors":"Chiara Sacco, Cinzia Viroli, Mario Falchi","doi":"10.1515/sagmb-2017-0007","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0007","url":null,"abstract":"<p><p>Genomic imprinting is an epigenetic mechanism that leads to differential contributions of maternal and paternal alleles to offspring gene expression in a parent-of-origin manner. We propose a novel test for detecting the parent-of-origin effects (POEs) in genome wide genotype data from related individuals (twins) when the parental origin cannot be inferred. The proposed method exploits a finite mixture of linear mixed models: the key idea is that in the case of POEs the population can be clustered in two different groups in which the reference allele is inherited by a different parent. A further advantage of this approach is the possibility to obtain an estimation of parental effect when the parental information is missing. We will also show that the approach is flexible enough to be applicable to the general scenario of independent data. The performance of the proposed test is evaluated through a wide simulation study. The method is finally applied to known imprinted genes of the MuTHER twin study data.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 4","pages":"275-289"},"PeriodicalIF":0.9,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0007","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35318751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heritability is the proportion of phenotypic variance in a population that is attributable to individual genotypes. Heritability is considered an important measure in both evolutionary biology and in medicine, and is routinely estimated and reported in genetic epidemiology studies. In population-based genome-wide association studies (GWAS), mixed models are used to estimate variance components, from which a heritability estimate is obtained. The estimated heritability is the proportion of the model's total variance that is due to the genetic relatedness matrix (kinship measured from genotypes). Current practice is to use bootstrapping, which is slow, or normal asymptotic approximation to estimate the precision of the heritability estimate; however, this approximation fails to hold near the boundaries of the parameter space or when the sample size is small. In this paper we propose to estimate variance components via a Haseman-Elston regression, find the asymptotic distribution of the variance components and proportions of variance, and use them to construct confidence intervals (CIs). Our method is further developed to obtain unbiased variance components estimators and construct CIs by meta-analyzing information from multiple studies. We demonstrate our approach on data from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).
{"title":"Confidence intervals for heritability via Haseman-Elston regression.","authors":"Tamar Sofer","doi":"10.1515/sagmb-2016-0076","DOIUrl":"10.1515/sagmb-2016-0076","url":null,"abstract":"<p><p>Heritability is the proportion of phenotypic variance in a population that is attributable to individual genotypes. Heritability is considered an important measure in both evolutionary biology and in medicine, and is routinely estimated and reported in genetic epidemiology studies. In population-based genome-wide association studies (GWAS), mixed models are used to estimate variance components, from which a heritability estimate is obtained. The estimated heritability is the proportion of the model's total variance that is due to the genetic relatedness matrix (kinship measured from genotypes). Current practice is to use bootstrapping, which is slow, or normal asymptotic approximation to estimate the precision of the heritability estimate; however, this approximation fails to hold near the boundaries of the parameter space or when the sample size is small. In this paper we propose to estimate variance components via a Haseman-Elston regression, find the asymptotic distribution of the variance components and proportions of variance, and use them to construct confidence intervals (CIs). Our method is further developed to obtain unbiased variance components estimators and construct CIs by meta-analyzing information from multiple studies. We demonstrate our approach on data from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 4","pages":"259-273"},"PeriodicalIF":0.9,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5857391/pdf/nihms922922.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35318749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nasim Ejlali, Mohammad Reza Faghihi, Mehdi Sadeghi
An important topic in bioinformatics is the protein structure alignment. Some statistical methods have been proposed for this problem, but most of them align two protein structures based on the global geometric information without considering the effect of neighbourhood in the structures. In this paper, we provide a Bayesian model to align protein structures, by considering the effect of both local and global geometric information of protein structures. Local geometric information is incorporated to the model through the partial Procrustes distance of small substructures. These substructures are composed of β-carbon atoms from the side chains. Parameters are estimated using a Markov chain Monte Carlo (MCMC) approach. We evaluate the performance of our model through some simulation studies. Furthermore, we apply our model to a real dataset and assess the accuracy and convergence rate. Results show that our model is much more efficient than previous approaches.
{"title":"Bayesian comparison of protein structures using partial Procrustes distance.","authors":"Nasim Ejlali, Mohammad Reza Faghihi, Mehdi Sadeghi","doi":"10.1515/sagmb-2016-0014","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0014","url":null,"abstract":"<p><p>An important topic in bioinformatics is the protein structure alignment. Some statistical methods have been proposed for this problem, but most of them align two protein structures based on the global geometric information without considering the effect of neighbourhood in the structures. In this paper, we provide a Bayesian model to align protein structures, by considering the effect of both local and global geometric information of protein structures. Local geometric information is incorporated to the model through the partial Procrustes distance of small substructures. These substructures are composed of β-carbon atoms from the side chains. Parameters are estimated using a Markov chain Monte Carlo (MCMC) approach. We evaluate the performance of our model through some simulation studies. Furthermore, we apply our model to a real dataset and assess the accuracy and convergence rate. Results show that our model is much more efficient than previous approaches.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 4","pages":"243-257"},"PeriodicalIF":0.9,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0014","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35318750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The systematic study of transcriptional responses to genetic and chemical perturbations in human cells is still in its early stages. The largest available dataset to date is the newly released L1000 compendium. With its 1.3 million gene expression profiles of treated human cells it offers many opportunities for biomedical data mining, but also data normalization challenges of new dimensions. We developed a novel and practical approach to obtain accurate estimates of fold change response profiles from L1000, based on the RUV (Remove Unwanted Variation) statistical framework. Extending RUV to a big data setting, we propose an estimation procedure, in which an underlying RUV model is tuned by feedback through dataset specific statistical measures, reflecting p-value distributions and internal gene knockdown controls. Applying these metrics - termed evaluation endpoints - to disjoint data splits and integrating the results to select an optimal normalization, the procedure reduces bias and noise in the L1000 data, which in turn broadens the potential of this resource for pharmacological and functional genomic analyses. Our pipeline and normalization results are distributed as an R package (nelanderlab.org/FC1000.html).
{"title":"FC1000: normalized gene expression changes of systematically perturbed human cells.","authors":"Ingrid M Lönnstedt, Sven Nelander","doi":"10.1515/sagmb-2016-0072","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0072","url":null,"abstract":"<p><p>The systematic study of transcriptional responses to genetic and chemical perturbations in human cells is still in its early stages. The largest available dataset to date is the newly released L1000 compendium. With its 1.3 million gene expression profiles of treated human cells it offers many opportunities for biomedical data mining, but also data normalization challenges of new dimensions. We developed a novel and practical approach to obtain accurate estimates of fold change response profiles from L1000, based on the RUV (Remove Unwanted Variation) statistical framework. Extending RUV to a big data setting, we propose an estimation procedure, in which an underlying RUV model is tuned by feedback through dataset specific statistical measures, reflecting p-value distributions and internal gene knockdown controls. Applying these metrics - termed evaluation endpoints - to disjoint data splits and integrating the results to select an optimal normalization, the procedure reduces bias and noise in the L1000 data, which in turn broadens the potential of this resource for pharmacological and functional genomic analyses. Our pipeline and normalization results are distributed as an R package (nelanderlab.org/FC1000.html).</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 4","pages":"217-242"},"PeriodicalIF":0.9,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35318753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shofiqul Islam, Sonia Anand, Jemila Hamid, Lehana Thabane, Joseph Beyene
Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.
{"title":"Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration.","authors":"Shofiqul Islam, Sonia Anand, Jemila Hamid, Lehana Thabane, Joseph Beyene","doi":"10.1515/sagmb-2016-0066","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0066","url":null,"abstract":"<p><p>Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 3","pages":"199-216"},"PeriodicalIF":0.9,"publicationDate":"2017-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35184782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many gene- and pathway-based association tests have been proposed in the literature. Among them, the SKAT is widely used, especially for rare variants association studies. In this paper, we investigate the connection between SKAT and a principal component analysis. This investigation leads to a procedure that encompasses SKAT as a special case. Through simulation studies and real data applications, we compare the proposed method with some existing tests.
{"title":"Genetic association test based on principal component analysis.","authors":"Zhongxue Chen, Shizhong Han, Kai Wang","doi":"10.1515/sagmb-2016-0061","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0061","url":null,"abstract":"<p><p>Many gene- and pathway-based association tests have been proposed in the literature. Among them, the SKAT is widely used, especially for rare variants association studies. In this paper, we investigate the connection between SKAT and a principal component analysis. This investigation leads to a procedure that encompasses SKAT as a special case. Through simulation studies and real data applications, we compare the proposed method with some existing tests.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 3","pages":"189-198"},"PeriodicalIF":0.9,"publicationDate":"2017-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0061","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35138475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haixiang Zhang, Yinan Zheng, Grace Yoon, Zhou Zhang, Tao Gao, Brian Joyce, Wei Zhang, Joel Schwartz, Pantel Vokonas, Elena Colicino, Andrea Baccarelli, Lifang Hou, Lei Liu
In this article, we consider variable selection for correlated high dimensional DNA methylation markers as multivariate outcomes. A novel weighted square-root LASSO procedure is proposed to estimate the regression coefficient matrix. A key feature of this method is tuning-insensitivity, which greatly simplifies the computation by obviating cross validation for penalty parameter selection. A precision matrix obtained via the constrained ℓ1 minimization method is used to account for the within-subject correlation among multivariate outcomes. Oracle inequalities of the regularized estimators are derived. The performance of our proposed method is illustrated via extensive simulation studies. We apply our method to study the relation between smoking and high dimensional DNA methylation markers in the Normative Aging Study (NAS).
{"title":"Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study.","authors":"Haixiang Zhang, Yinan Zheng, Grace Yoon, Zhou Zhang, Tao Gao, Brian Joyce, Wei Zhang, Joel Schwartz, Pantel Vokonas, Elena Colicino, Andrea Baccarelli, Lifang Hou, Lei Liu","doi":"10.1515/sagmb-2016-0073","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0073","url":null,"abstract":"<p><p>In this article, we consider variable selection for correlated high dimensional DNA methylation markers as multivariate outcomes. A novel weighted square-root LASSO procedure is proposed to estimate the regression coefficient matrix. A key feature of this method is tuning-insensitivity, which greatly simplifies the computation by obviating cross validation for penalty parameter selection. A precision matrix obtained via the constrained ℓ1 minimization method is used to account for the within-subject correlation among multivariate outcomes. Oracle inequalities of the regularized estimators are derived. The performance of our proposed method is illustrated via extensive simulation studies. We apply our method to study the relation between smoking and high dimensional DNA methylation markers in the Normative Aging Study (NAS).</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 3","pages":"159-171"},"PeriodicalIF":0.9,"publicationDate":"2017-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0073","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35190151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multilocus haplotype analysis of candidate variants with genome wide association studies (GWAS) data may provide evidence of association with disease, even when the individual loci themselves do not. Unfortunately, when a large number of candidate variants are investigated, identifying risk haplotypes can be very difficult. To meet the challenge, a number of approaches have been put forward in recent years. However, most of them are not directly linked to the disease-penetrances of haplotypes and thus may not be efficient. To fill this gap, we propose a mixture model-based approach for detecting risk haplotypes. Under the mixture model, haplotypes are clustered directly according to their estimated disease penetrances. A theoretical justification of the above model is provided. Furthermore, we introduce a hypothesis test for haplotype inheritance patterns which underpin this model. The performance of the proposed approach is evaluated by simulations and real data analysis. The results show that the proposed approach outperforms an existing multiple testing method.
{"title":"Mixture model-based association analysis with case-control data in genome wide association studies.","authors":"Fadhaa Ali, Jian Zhang","doi":"10.1515/sagmb-2016-0022","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0022","url":null,"abstract":"<p><p>Multilocus haplotype analysis of candidate variants with genome wide association studies (GWAS) data may provide evidence of association with disease, even when the individual loci themselves do not. Unfortunately, when a large number of candidate variants are investigated, identifying risk haplotypes can be very difficult. To meet the challenge, a number of approaches have been put forward in recent years. However, most of them are not directly linked to the disease-penetrances of haplotypes and thus may not be efficient. To fill this gap, we propose a mixture model-based approach for detecting risk haplotypes. Under the mixture model, haplotypes are clustered directly according to their estimated disease penetrances. A theoretical justification of the above model is provided. Furthermore, we introduce a hypothesis test for haplotype inheritance patterns which underpin this model. The performance of the proposed approach is evaluated by simulations and real data analysis. The results show that the proposed approach outperforms an existing multiple testing method.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 3","pages":"173-187"},"PeriodicalIF":0.9,"publicationDate":"2017-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35182457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}