Statistical Applications in Genetics and Molecular Biology最新文献_第7页

Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships. 任意家谱关系的数据自适应多位点关联检验。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-04-08 DOI: 10.1515/sagmb-2018-0030

Gail Gong, Wei Wang, Chih-Lin Hsieh, David J Van Den Berg, Christopher Haiman, Ingrid Oakley-Girvan, Alice S Whittemore

Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) .

全基因组测序能够评估性状之间的关联以及基因和途径中变异的组合。但这样的评价需要多基因座关联检验，且检验的效力较好，而不考虑变异和性状特征。而且，由于分析家庭可能比分析不相关的个体产生更多的力量，我们需要适用于相关和不相关个体的多位点测试。在这里，我们描述了这样的测试，并介绍了SKAT-X，这是一种新的测试统计量，它使用从相关或不相关的受试者获得的全基因组数据来优化手头特定数据的功率。仿真结果表明:a)无论变异特性和性状特性如何，SKAT-X都具有良好的性能;b)对于二元性状，分析受影响的亲属比分析不相关的个体更有效，这与先前的单位点测试结果一致。我们将这些方法应用于肿瘤抑制基因BRCA2中罕见的未分类错义变异，并应用于多种族队列(MEC)中来自前列腺癌家族和不相关前列腺癌病例和对照的综合数据。这些方法可以使用开放源代码作为r包GATARS(任意相关主题的遗传关联测试)来实现。

{"title":"Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships.","authors":"Gail Gong, Wei Wang, Chih-Lin Hsieh, David J Van Den Berg, Christopher Haiman, Ingrid Oakley-Girvan, Alice S Whittemore","doi":"10.1515/sagmb-2018-0030","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0030","url":null,"abstract":"Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) <https://gailg.github.io/gatars/>.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37127926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A multivariate linear model for investigating the association between gene-module co-expression and a continuous covariate. 一个用于研究基因-模块共表达与连续协变量之间关系的多元线性模型。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-03-15 DOI: 10.1515/sagmb-2018-0008

Trishanta Padayachee, Tatsiana Khamiakova, Ziv Shkedy, Perttu Salo, Markus Perola, Tomasz Burzykowski

A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene co-expression (i.e. gene-pair correlations). Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorizing a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, we use a general linear model (GLM) for correlated data to study the relationship between gene-module co-expression and a covariate like metabolite concentration. The GLM specifies the gene-pair correlations as a function of the continuous covariate. The use of the GLM allows for investigating different (linear and non-linear) patterns of co-expression. Furthermore, the modeling approach offers a formal framework for testing hypotheses about possible patterns of co-expression. In our paper, a simulation study is used to assess the performance of the GLM. The performance is compared with that of a previously proposed GLM that utilizes categorized covariates. The versatility of the model is illustrated by using a real-life example. We discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting of the proposed model.

研究细胞环境对基因共表达(即基因对相关性)的影响是提高我们对复杂疾病发生和发展的理解的一种方法。通常，通过对连续协变量进行分类，研究基因共表达的变化在两个或多个生物学条件下。然而，任意截断点的选择可能会对分析结果产生影响。为了解决这个问题，我们对相关数据使用一般线性模型(GLM)来研究基因-模块共表达与代谢物浓度等协变量之间的关系。GLM将基因对相关性指定为连续协变量的函数。使用GLM可以研究不同的(线性和非线性)共表达模式。此外，建模方法为测试关于可能的共表达模式的假设提供了一个形式化框架。在本文中，我们使用仿真研究来评估GLM的性能。将性能与先前提出的利用分类协变量的GLM进行比较。通过一个现实生活中的例子说明了该模型的多功能性。我们讨论了与检验统计量的构建相关的理论问题以及与拟合所提出的模型相关的计算挑战。

{"title":"A multivariate linear model for investigating the association between gene-module co-expression and a continuous covariate.","authors":"Trishanta Padayachee, Tatsiana Khamiakova, Ziv Shkedy, Perttu Salo, Markus Perola, Tomasz Burzykowski","doi":"10.1515/sagmb-2018-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0008","url":null,"abstract":"A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene co-expression (i.e. gene-pair correlations). Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorizing a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, we use a general linear model (GLM) for correlated data to study the relationship between gene-module co-expression and a covariate like metabolite concentration. The GLM specifies the gene-pair correlations as a function of the continuous covariate. The use of the GLM allows for investigating different (linear and non-linear) patterns of co-expression. Furthermore, the modeling approach offers a formal framework for testing hypotheses about possible patterns of co-expression. In our paper, a simulation study is used to assess the performance of the GLM. The performance is compared with that of a previously proposed GLM that utilizes categorized covariates. The versatility of the model is illustrated by using a real-life example. We discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting of the proposed model.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37234437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

netprioR: a probabilistic model for integrative hit prioritisation of genetic screens. netprioR:遗传筛选综合命中优先级的概率模型。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-03-06 DOI: 10.1515/sagmb-2018-0033

Fabian Schmich, Jack Kuipers, Gunter Merdes, Niko Beerenwinkel

In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene-gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.

在生物学大数据的后基因组时代，整合多个异构数据集的计算方法变得越来越重要。尽管有大量的组学数据，但基于遗传筛选实验的特定功能通路相关基因的优先级仍然是一项具有挑战性的任务。在这里，我们介绍了netprioR，一个概率生成模型，用于对命中基因进行半监督整合优先排序。该模型整合了多个网络数据集，这些数据集代表基因的相似性和文献中关于基因功能的先验知识，以及基于基因的协变量，例如通过RNA干扰或CRISPR/Cas9在遗传扰动筛选中测量的表型。我们在模拟数据上评估了netprioR，并表明该模型在许多情况下优于当前最先进的方法，在其他方面也不相上下。在实际生物学数据的应用中，我们整合了22个网络数据集，1784个先验知识类别标签和3840个RNA干扰表型，以便优先考虑果蝇Notch信号的新调节因子。我们的预测的生物学相关性是评估使用在硅和体内实验。netprioR的有效实现可以在http://bioconductor.org/packages/netprioR上以R包的形式获得。

{"title":"netprioR: a probabilistic model for integrative hit prioritisation of genetic screens.","authors":"Fabian Schmich, Jack Kuipers, Gunter Merdes, Niko Beerenwinkel","doi":"10.1515/sagmb-2018-0033","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0033","url":null,"abstract":"In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene-gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37204570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences. 基于离散小波包变换的全基因组序列判别分析。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-02-15 DOI: 10.1515/sagmb-2018-0045

Hsin-Hsiung Huang, Senthil Balaji Girimurugan

Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.

近年来，无比对方法在基因组序列比较中得到了广泛的应用，因为这些方法计算效率高，可以提供理想的系统发育分析结果。这些方法已成功地与分层聚类方法相结合，用于寻找系统发育树。然而，将这些无对齐的方法直接应用到现有的统计分类方法中可能并不合适，因为目前还缺乏与无对齐表示方法相结合的合适的统计分类理论。本文提出了一种利用离散小波包变换对全基因组序列进行分类的判别分析方法。所提出的特征的无对齐表示统计量渐近地服从联合正态分布。数据分析结果表明，该方法在实时性上取得了令人满意的分类效果。

引用次数: 1

LCox: a tool for selecting genes related to survival outcomes using longitudinal gene expression data. LCox:使用纵向基因表达数据选择与生存结果相关的基因的工具。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-02-13 DOI: 10.1515/sagmb-2017-0060

Jiehuan Sun, Jose D Herazo-Maya, Jane-Ling Wang, Naftali Kaminski, Hongyu Zhao

Longitudinal genomics data and survival outcome are common in biomedical studies, where the genomics data are often of high dimension. It is of great interest to select informative longitudinal biomarkers (e.g. genes) related to the survival outcome. In this paper, we develop a computationally efficient tool, LCox, for selecting informative biomarkers related to the survival outcome using the longitudinal genomics data. LCox is powerful to detect different forms of dependence between the longitudinal biomarkers and the survival outcome. We show that LCox has improved performance compared to existing methods through extensive simulation studies. In addition, by applying LCox to a dataset of patients with idiopathic pulmonary fibrosis, we are able to identify biologically meaningful genes while all other methods fail to make any discovery. An R package to perform LCox is freely available at https://CRAN.R-project.org/package=LCox.

纵向基因组学数据和生存结果在生物医学研究中很常见，其中基因组学数据通常是高维的。选择与生存结果相关的信息性纵向生物标志物(如基因)是非常有趣的。在本文中，我们开发了一个计算效率高的工具LCox，用于使用纵向基因组学数据选择与生存结果相关的信息性生物标志物。LCox在检测纵向生物标志物与生存结果之间不同形式的依赖性方面具有强大的功能。我们通过广泛的仿真研究表明，与现有方法相比，LCox的性能有所提高。此外，通过将LCox应用于特发性肺纤维化患者的数据集，我们能够识别出具有生物学意义的基因，而所有其他方法都无法发现任何基因。执行LCox的R包可以在https://CRAN.R-project.org/package=LCox上免费获得。

引用次数: 0

Meta-analytic framework for modeling genetic coexpression dynamics. 遗传共表达动态建模的元分析框架。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-02-09 DOI: 10.1515/sagmb-2017-0052

Tyler G Kinzy, Timothy K Starr, George C Tseng, Yen-Yi Ho

Methods for exploring genetic interactions have been developed in an attempt to move beyond single gene analyses. Because biological molecules frequently participate in different processes under various cellular conditions, investigating the changes in gene coexpression patterns under various biological conditions could reveal important regulatory mechanisms. One of the methods for capturing gene coexpression dynamics, named liquid association (LA), quantifies the relationship where the coexpression between two genes is modulated by a third "coordinator" gene. This LA measure offers a natural framework for studying gene coexpression changes and has been applied increasingly to study regulatory networks among genes. With a wealth of publicly available gene expression data, there is a need to develop a meta-analytic framework for LA analysis. In this paper, we incorporated mixed effects when modeling correlation to account for between-studies heterogeneity. For statistical inference about LA, we developed a Markov chain Monte Carlo (MCMC) estimation procedure through a Bayesian hierarchical framework. We evaluated the proposed methods in a set of simulations and illustrated their use in two collections of experimental data sets. The first data set combined 10 pancreatic ductal adenocarcinoma gene expression studies to determine the role of possible coordinator gene USP9X in the Hippo pathway. The second experimental data set consisted of 907 gene expression microarray Escherichia coli experiments from multiple studies publicly available through the Many Microbe Microarray Database website (http://m3d.bu.edu/) and examined genes that coexpress with serA in the presence of coordinator gene Lrp.

为了超越单基因分析，人们开发了探索基因相互作用的方法。由于生物分子经常参与各种细胞条件下的不同过程，因此研究各种生物条件下基因共表达模式的变化可以揭示重要的调控机制。其中一种捕捉基因共表达动态的方法被命名为液体关联（LA），它量化了两个基因之间的共表达受第三个 "协调 "基因调节的关系。液态关联测量为研究基因共表达变化提供了一个自然框架，并越来越多地被应用于研究基因间的调控网络。随着大量基因表达数据的公开，有必要为 LA 分析开发一个元分析框架。在本文中，我们在建立相关性模型时加入了混合效应，以考虑研究间的异质性。为了对 LA 进行统计推断，我们通过贝叶斯分层框架开发了马尔科夫链蒙特卡罗（MCMC）估计程序。我们在一组模拟中评估了所提出的方法，并在两组实验数据中说明了这些方法的用途。第一个数据集结合了 10 个胰腺导管腺癌基因表达研究，以确定可能的协调基因 USP9X 在 Hippo 通路中的作用。第二个实验数据集包括 907 个基因表达微阵列大肠杆菌实验，这些实验来自多项研究，可通过许多微生物微阵列数据库网站（http://m3d.bu.edu/）公开获取，并研究了在协调基因 Lrp 存在的情况下与 serA 共同表达的基因。

{"title":"Meta-analytic framework for modeling genetic coexpression dynamics.","authors":"Tyler G Kinzy, Timothy K Starr, George C Tseng, Yen-Yi Ho","doi":"10.1515/sagmb-2017-0052","DOIUrl":"10.1515/sagmb-2017-0052","url":null,"abstract":"Methods for exploring genetic interactions have been developed in an attempt to move beyond single gene analyses. Because biological molecules frequently participate in different processes under various cellular conditions, investigating the changes in gene coexpression patterns under various biological conditions could reveal important regulatory mechanisms. One of the methods for capturing gene coexpression dynamics, named liquid association (LA), quantifies the relationship where the coexpression between two genes is modulated by a third \"coordinator\" gene. This LA measure offers a natural framework for studying gene coexpression changes and has been applied increasingly to study regulatory networks among genes. With a wealth of publicly available gene expression data, there is a need to develop a meta-analytic framework for LA analysis. In this paper, we incorporated mixed effects when modeling correlation to account for between-studies heterogeneity. For statistical inference about LA, we developed a Markov chain Monte Carlo (MCMC) estimation procedure through a Bayesian hierarchical framework. We evaluated the proposed methods in a set of simulations and illustrated their use in two collections of experimental data sets. The first data set combined 10 pancreatic ductal adenocarcinoma gene expression studies to determine the role of possible coordinator gene USP9X in the Hippo pathway. The second experimental data set consisted of 907 gene expression microarray Escherichia coli experiments from multiple studies publicly available through the Many Microbe Microarray Database website (http://m3d.bu.edu/) and examined genes that coexpress with serA in the presence of coordinator gene Lrp.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36944546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sliced inverse regression for integrative multi-omics data analysis. 切片逆回归整合多组学数据分析。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-01-26 DOI: 10.1515/sagmb-2018-0028

Yashita Jain, Shanshan Ding, Jing Qiu

Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.

新一代测序、转录组学、蛋白质组学和其他高通量技术的进步，使多种癌症样本基因组数据的同时测量成为可能。与分析单一基因组类型数据相比，这些数据加在一起可能会揭示新的生物学见解。本研究提出了一种新的监督降维方法，称为切片逆回归，用于多组学数据分析，以提高对单一数据类型分析的预测。本研究进一步提出了一种整合切片逆回归方法(integrative slicing inverse regression method, integrated SIR)，用于同时分析癌症样本的多组学数据类型，包括MiRNA、MRNA和蛋白质组学，实现整合降维，进一步提高预测性能。数值结果表明，与单一数据源分析相比，多组学数据的整合分析是有益的，更重要的是，与无监督降维方法相比，监督降维方法在整合数据分析中的分类和预测方面具有优势。

{"title":"Sliced inverse regression for integrative multi-omics data analysis.","authors":"Yashita Jain, Shanshan Ding, Jing Qiu","doi":"10.1515/sagmb-2018-0028","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0028","url":null,"abstract":"Advancement in next-generation sequencing, transcriptomics, proteomics and other high-throughput technologies has enabled simultaneous measurement of multiple types of genomic data for cancer samples. These data together may reveal new biological insights as compared to analyzing one single genome type data. This study proposes a novel use of supervised dimension reduction method, called sliced inverse regression, to multi-omics data analysis to improve prediction over a single data type analysis. The study further proposes an integrative sliced inverse regression method (integrative SIR) for simultaneous analysis of multiple omics data types of cancer samples, including MiRNA, MRNA and proteomics, to achieve integrative dimension reduction and to further improve prediction performance. Numerical results show that integrative analysis of multi-omics data is beneficial as compared to single data source analysis, and more importantly, that supervised dimension reduction methods possess advantages in integrative data analysis in terms of classification and prediction as compared to unsupervised dimension reduction methods.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0028","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36901134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A powerful test for ordinal trait genetic association analysis. 序性状遗传关联分析的有力检验。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-01-26 DOI: 10.1515/sagmb-2017-0066

Yuan Xue, Jinjuan Wang, Juan Ding, Sanguo Zhang, Qizhai Li

Response selective sampling design is commonly adopted in genetic epidemiologic study because it can substantially reduce time cost and increase power of identifying deleterious genetic variants predispose to human complex disease comparing with prospective design. The proportional odds model (POM) can be used to fit data obtained by this design. Unlike the logistic regression model, the estimated genetic effect based on POM by taking data as being enrolled prospectively is inconsistent. So the power of resulted Wald test is not satisfactory. The modified POM is suitable to fit this type of data, however, the corresponding Wald test is not optimal when the genetic effect is small. Here, we propose a new association test to handle this issue. Simulation studies show that the proposed test can control the type I error rate correctly and is more powerful than two existing methods. Finally, we applied three tests to Anticyclic Citrullinated Protein Antibody data from Genetic Workshop 16.

与前瞻性设计相比，反应选择性抽样设计可以大大减少时间成本，提高识别人类复杂疾病易感性的有害遗传变异的能力，是遗传流行病学研究中常用的方法。比例赔率模型(POM)可用于拟合该设计获得的数据。与logistic回归模型不同，采用前瞻性纳入数据的POM估计遗传效应是不一致的。因此，得到的Wald检验的有效性不能令人满意。改进的POM适合拟合这类数据，但当遗传效应较小时，Wald检验不是最优的。在这里，我们提出一个新的关联测试来解决这个问题。仿真研究表明，该方法能较好地控制I型错误率，比现有的两种方法更有效。最后，我们对来自遗传研讨会16的抗环瓜氨酸蛋白抗体数据进行了三种测试。

引用次数: 2

Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model. 使用负二项回归模型计算RNA-seq数据差异表达分析的样本量。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-01-22 DOI: 10.1515/sagmb-2018-0021

Xiaohong Li, Dongfeng Wu, Nigel G F Cooper, Shesh N Rai

High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.

高通量RNA测序(RNA-seq)技术越来越多地应用于疾病相关生物标志物的研究。负二项分布已成为RNA-seq数据中基因读取计数建模的流行选择，因为读取计数过于分散。在本研究中，我们使用负二项回归模型对RNA-seq数据提出了两种显式样本量计算方法。为了得到这些新的样本量公式，将常见的色散参数和作为偏移量的大小因子通过自然对数链接函数结合起来。从系数参数导出的双侧Wald检验统计量用于在名义显著性水平0.05下测试单个基因，在错误发现率0.05下测试多个基因。Wald检验的方差由方差-协方差矩阵计算，参数由无限制和约束情景下的最大似然估计估计。通过仿真研究评估了新公式的性能，并将其与现有的三种方法(Wald检验、似然比检验或精确检验)进行了并排比较。由于其他方法的计算量很大，我们推荐我们的M1方法在实验设计中快速直接估计样本量。最后，我们使用现有的乳腺癌RNA-seq数据说明样本量估计。

{"title":"Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model.","authors":"Xiaohong Li, Dongfeng Wu, Nigel G F Cooper, Shesh N Rai","doi":"10.1515/sagmb-2018-0021","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0021","url":null,"abstract":"High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0021","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36885647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

MLML2R: an R package for maximum likelihood estimation of DNA methylation and hydroxymethylation proportions. MLML2R:一个R包DNA甲基化和羟甲基化比例的最大似然估计。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-01-17 DOI: 10.1515/sagmb-2018-0031

Samara F Kiihl, Maria Jose Martinez-Garrido, Arce Domingo-Relloso, Jose Bermudez, Maria Tellez-Plaza

Accurately measuring epigenetic marks such as 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) at the single-nucleotide level, requires combining data from DNA processing methods including traditional (BS), oxidative (oxBS) or Tet-Assisted (TAB) bisulfite conversion. We introduce the R package MLML2R, which provides maximum likelihood estimates (MLE) of 5-mC and 5-hmC proportions. While all other available R packages provide 5-mC and 5-hmC MLEs only for the oxBS+BS combination, MLML2R also provides MLE for TAB combinations. For combinations of any two of the methods, we derived the pool-adjacent-violators algorithm (PAVA) exact constrained MLE in analytical form. For the three methods combination, we implemented both the iterative method by Qu et al. [Qu, J., M. Zhou, Q. Song, E. E. Hong and A. D. Smith (2013): "Mlml: consistent simultaneous estimates of dna methylation and hydroxymethylation," Bioinformatics, 29, 2645-2646.], and also a novel non iterative approximation using Lagrange multipliers. The newly proposed non iterative solutions greatly decrease computational time, common bottlenecks when processing high-throughput data. The MLML2R package is flexible as it takes as input both, preprocessed intensities from Infinium Methylation arrays and counts from Next Generation Sequencing technologies. The MLML2R package is freely available at https://CRAN.R-project.org/package=MLML2R.

在单核苷酸水平上精确测量5-甲基胞嘧啶(5-mC)和5-羟甲基胞嘧啶(5-hmC)等表观遗传标记，需要结合DNA处理方法的数据，包括传统(BS)，氧化(oxBS)或et辅助(TAB)亚硫酸氢盐转化。我们介绍了R包MLML2R，它提供了5-mC和5-hmC比例的最大似然估计(MLE)。虽然所有其他可用的R包仅为oxBS+BS组合提供5-mC和5-hmC MLE，但MLML2R还为TAB组合提供了MLE。对于任意两种方法的组合，我们以解析形式导出了池邻接违反者算法(PAVA)的精确约束MLE。对于这三种方法的组合，我们实现了Qu等人的迭代方法[Qu, J, M. Zhou, Q. Song, E. E. Hong and A. D. Smith(2013):“Mlml: dna甲基化和羟甲基化的一致同时估计”，生物信息学，29,2645-2646。]，以及使用拉格朗日乘法器的一种新颖的非迭代近似。新提出的非迭代解决方案大大减少了处理高吞吐量数据时常见的计算时间瓶颈。MLML2R封装是灵活的，因为它需要输入，来自Infinium甲基化阵列的预处理强度和来自下一代测序技术的计数。MLML2R包可在https://CRAN.R-project.org/package=MLML2R免费获得。

{"title":"MLML2R: an R package for maximum likelihood estimation of DNA methylation and hydroxymethylation proportions.","authors":"Samara F Kiihl, Maria Jose Martinez-Garrido, Arce Domingo-Relloso, Jose Bermudez, Maria Tellez-Plaza","doi":"10.1515/sagmb-2018-0031","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0031","url":null,"abstract":"Accurately measuring epigenetic marks such as 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) at the single-nucleotide level, requires combining data from DNA processing methods including traditional (BS), oxidative (oxBS) or Tet-Assisted (TAB) bisulfite conversion. We introduce the R package MLML2R, which provides maximum likelihood estimates (MLE) of 5-mC and 5-hmC proportions. While all other available R packages provide 5-mC and 5-hmC MLEs only for the oxBS+BS combination, MLML2R also provides MLE for TAB combinations. For combinations of any two of the methods, we derived the pool-adjacent-violators algorithm (PAVA) exact constrained MLE in analytical form. For the three methods combination, we implemented both the iterative method by Qu et al. [Qu, J., M. Zhou, Q. Song, E. E. Hong and A. D. Smith (2013): \"Mlml: consistent simultaneous estimates of dna methylation and hydroxymethylation,\" Bioinformatics, 29, 2645-2646.], and also a novel non iterative approximation using Lagrange multipliers. The newly proposed non iterative solutions greatly decrease computational time, common bottlenecks when processing high-throughput data. The MLML2R package is flexible as it takes as input both, preprocessed intensities from Infinium Methylation arrays and counts from Next Generation Sequencing technologies. The MLML2R package is freely available at https://CRAN.R-project.org/package=MLML2R.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36872982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10