Alternating EM algorithm for a bilinear model in isoform quantification from RNA-seq data.

IF 5.4 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Bioinformatics Pub Date : 2020-02-01 DOI:10.1093/bioinformatics/btz640
Wenjiang Deng, Tian Mou, Krishna R Kalari, Nifang Niu, Liewei Wang, Yudi Pawitan, Trung Nghia Vu
{"title":"Alternating EM algorithm for a bilinear model in isoform quantification from RNA-seq data.","authors":"Wenjiang Deng,&nbsp;Tian Mou,&nbsp;Krishna R Kalari,&nbsp;Nifang Niu,&nbsp;Liewei Wang,&nbsp;Yudi Pawitan,&nbsp;Trung Nghia Vu","doi":"10.1093/bioinformatics/btz640","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Estimation of isoform-level gene expression from RNA-seq data depends on simplifying assumptions, such as uniform read distribution, that are easily violated in real data. Such violations typically lead to biased estimates. Most existing methods provide bias correction step(s), which is based on biological considerations-such as GC content-and applied in single samples separately. The main problem is that not all biases are known.</p><p><strong>Results: </strong>We have developed a novel method called XAEM based on a more flexible and robust statistical model. Existing methods are essentially based on a linear model Xβ, where the design matrix X is known and is computed based on the simplifying assumptions. In contrast XAEM considers Xβ as a bilinear model with both X and β unknown. Joint estimation of X and β is made possible by a simultaneous analysis of multi-sample RNA-seq data. Compared to existing methods, XAEM automatically performs empirical correction of potentially unknown biases. We use an alternating expectation-maximization (AEM) algorithm, alternating between estimation of X and β. For speed XAEM utilizes quasi-mapping for read alignment, thus leading to a fast algorithm. Overall XAEM performs favorably compared to recent advanced methods. For simulated datasets, XAEM obtains higher accuracy for multiple-isoform genes. In a differential-expression analysis of a real single-cell RNA-seq dataset, XAEM achieves substantially better rediscovery rates in independent validation sets.</p><p><strong>Availability and implementation: </strong>The method and pipeline are implemented as a tool and freely available for use at http://fafner.meb.ki.se/biostatwiki/xaem/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"36 3","pages":"805-812"},"PeriodicalIF":5.4000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1093/bioinformatics/btz640","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btz640","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 9

Abstract

Motivation: Estimation of isoform-level gene expression from RNA-seq data depends on simplifying assumptions, such as uniform read distribution, that are easily violated in real data. Such violations typically lead to biased estimates. Most existing methods provide bias correction step(s), which is based on biological considerations-such as GC content-and applied in single samples separately. The main problem is that not all biases are known.

Results: We have developed a novel method called XAEM based on a more flexible and robust statistical model. Existing methods are essentially based on a linear model Xβ, where the design matrix X is known and is computed based on the simplifying assumptions. In contrast XAEM considers Xβ as a bilinear model with both X and β unknown. Joint estimation of X and β is made possible by a simultaneous analysis of multi-sample RNA-seq data. Compared to existing methods, XAEM automatically performs empirical correction of potentially unknown biases. We use an alternating expectation-maximization (AEM) algorithm, alternating between estimation of X and β. For speed XAEM utilizes quasi-mapping for read alignment, thus leading to a fast algorithm. Overall XAEM performs favorably compared to recent advanced methods. For simulated datasets, XAEM obtains higher accuracy for multiple-isoform genes. In a differential-expression analysis of a real single-cell RNA-seq dataset, XAEM achieves substantially better rediscovery rates in independent validation sets.

Availability and implementation: The method and pipeline are implemented as a tool and freely available for use at http://fafner.meb.ki.se/biostatwiki/xaem/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
交替EM算法的双线性模型在同型定量从RNA-seq数据。
动机:从RNA-seq数据中估计同型水平的基因表达依赖于简化的假设,例如均匀读分布,而这些假设在实际数据中很容易被违背。这种违反通常会导致有偏差的估计。大多数现有的方法提供偏差校正步骤,这是基于生物学的考虑-如GC含量-并单独应用于单个样品。主要的问题是,并非所有的偏见都是已知的。结果:我们开发了一种基于更灵活和稳健的统计模型的新方法,称为XAEM。现有的方法基本上是基于线性模型Xβ,其中设计矩阵X是已知的,并根据简化的假设计算。相反,XAEM认为Xβ是X和β都未知的双线性模型。通过同时分析多样本RNA-seq数据,可以联合估计X和β。与现有方法相比,XAEM可以自动执行潜在未知偏差的经验校正。我们使用交替期望最大化(AEM)算法,交替估计X和β。在速度方面,XAEM利用准映射进行读取对齐,从而形成快速算法。总的来说,与最近的先进方法相比,XAEM表现良好。对于模拟数据集,XAEM对多异构体基因具有较高的准确性。在对真实单细胞RNA-seq数据集的差异表达分析中,XAEM在独立验证集中实现了更好的再发现率。可用性和实施:该方法和管道作为工具实施,可在http://fafner.meb.ki.se/biostatwiki/xaem/.Supplementary上免费使用:补充数据可在Bioinformatics在线上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Bioinformatics
Bioinformatics 生物-生化研究方法
CiteScore
11.20
自引率
5.20%
发文量
753
审稿时长
2.1 months
期刊介绍: The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.
期刊最新文献
MEHunter: Transformer-based mobile element variant detection from long reads Metabolic syndrome may be more frequent in treatment-naive sarcoidosis patients. Coracle—A Machine Learning Framework to Identify Bacteria Associated with Continuous Variables CoSIA: an R Bioconductor package for CrOss Species Investigation and Analysis LncLocFormer: a Transformer-based deep learning model for multi-label lncRNA subcellular localization prediction by using localization-specific attention mechanism
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1