Alternating EM algorithm for a bilinear model in isoform quantification from RNA-seq data.

IF 5.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS Bioinformatics Pub Date : 2020-02-01 DOI:10.1093/bioinformatics/btz640

Wenjiang Deng, Tian Mou, Krishna R Kalari, Nifang Niu, Liewei Wang, Yudi Pawitan, Trung Nghia Vu

{"title":"Alternating EM algorithm for a bilinear model in isoform quantification from RNA-seq data.","authors":"Wenjiang Deng, Tian Mou, Krishna R Kalari, Nifang Niu, Liewei Wang, Yudi Pawitan, Trung Nghia Vu","doi":"10.1093/bioinformatics/btz640","DOIUrl":null,"url":null,"abstract":"Motivation: Estimation of isoform-level gene expression from RNA-seq data depends on simplifying assumptions, such as uniform read distribution, that are easily violated in real data. Such violations typically lead to biased estimates. Most existing methods provide bias correction step(s), which is based on biological considerations-such as GC content-and applied in single samples separately. The main problem is that not all biases are known.Results: We have developed a novel method called XAEM based on a more flexible and robust statistical model. Existing methods are essentially based on a linear model Xβ, where the design matrix X is known and is computed based on the simplifying assumptions. In contrast XAEM considers Xβ as a bilinear model with both X and β unknown. Joint estimation of X and β is made possible by a simultaneous analysis of multi-sample RNA-seq data. Compared to existing methods, XAEM automatically performs empirical correction of potentially unknown biases. We use an alternating expectation-maximization (AEM) algorithm, alternating between estimation of X and β. For speed XAEM utilizes quasi-mapping for read alignment, thus leading to a fast algorithm. Overall XAEM performs favorably compared to recent advanced methods. For simulated datasets, XAEM obtains higher accuracy for multiple-isoform genes. In a differential-expression analysis of a real single-cell RNA-seq dataset, XAEM achieves substantially better rediscovery rates in independent validation sets.Availability and implementation: The method and pipeline are implemented as a tool and freely available for use at http://fafner.meb.ki.se/biostatwiki/xaem/.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"36 3","pages":"805-812"},"PeriodicalIF":5.4000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1093/bioinformatics/btz640","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btz640","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 9

Abstract

Motivation: Estimation of isoform-level gene expression from RNA-seq data depends on simplifying assumptions, such as uniform read distribution, that are easily violated in real data. Such violations typically lead to biased estimates. Most existing methods provide bias correction step(s), which is based on biological considerations-such as GC content-and applied in single samples separately. The main problem is that not all biases are known.

Results: We have developed a novel method called XAEM based on a more flexible and robust statistical model. Existing methods are essentially based on a linear model Xβ, where the design matrix X is known and is computed based on the simplifying assumptions. In contrast XAEM considers Xβ as a bilinear model with both X and β unknown. Joint estimation of X and β is made possible by a simultaneous analysis of multi-sample RNA-seq data. Compared to existing methods, XAEM automatically performs empirical correction of potentially unknown biases. We use an alternating expectation-maximization (AEM) algorithm, alternating between estimation of X and β. For speed XAEM utilizes quasi-mapping for read alignment, thus leading to a fast algorithm. Overall XAEM performs favorably compared to recent advanced methods. For simulated datasets, XAEM obtains higher accuracy for multiple-isoform genes. In a differential-expression analysis of a real single-cell RNA-seq dataset, XAEM achieves substantially better rediscovery rates in independent validation sets.

Availability and implementation: The method and pipeline are implemented as a tool and freely available for use at http://fafner.meb.ki.se/biostatwiki/xaem/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

交替EM算法的双线性模型在同型定量从RNA-seq数据。

动机:从RNA-seq数据中估计同型水平的基因表达依赖于简化的假设，例如均匀读分布，而这些假设在实际数据中很容易被违背。这种违反通常会导致有偏差的估计。大多数现有的方法提供偏差校正步骤，这是基于生物学的考虑-如GC含量-并单独应用于单个样品。主要的问题是，并非所有的偏见都是已知的。结果:我们开发了一种基于更灵活和稳健的统计模型的新方法，称为XAEM。现有的方法基本上是基于线性模型Xβ，其中设计矩阵X是已知的，并根据简化的假设计算。相反，XAEM认为Xβ是X和β都未知的双线性模型。通过同时分析多样本RNA-seq数据，可以联合估计X和β。与现有方法相比，XAEM可以自动执行潜在未知偏差的经验校正。我们使用交替期望最大化(AEM)算法，交替估计X和β。在速度方面，XAEM利用准映射进行读取对齐，从而形成快速算法。总的来说，与最近的先进方法相比，XAEM表现良好。对于模拟数据集，XAEM对多异构体基因具有较高的准确性。在对真实单细胞RNA-seq数据集的差异表达分析中，XAEM在独立验证集中实现了更好的再发现率。可用性和实施:该方法和管道作为工具实施，可在http://fafner.meb.ki.se/biostatwiki/xaem/.Supplementary上免费使用:补充数据可在Bioinformatics在线上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics 生物-生化研究方法

CiteScore

11.20

自引率

5.20%

发文量

753

审稿时长

2.1 months

期刊介绍： The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.