Compound models and Pearson residuals for single-cell RNA-seq data without UMIs.

bioRxiv : the preprint server for biology Pub Date : 2024-07-25 DOI:10.1101/2023.08.02.551637

Jan Lause, Christoph Ziegenhain, Leonard Hartmanis, Philipp Berens, Dmitry Kobak

引用次数: 0

Abstract

Recent work employed Pearson residuals from Poisson or negative binomial models to normalize UMI data. To extend this approach to non-UMI data, we model the additional amplification step with a compound distribution: we assume that sequenced RNA molecules follow a negative binomial distribution, and are then replicated following an amplification distribution. We show how this model leads to compound Pearson residuals, which yield meaningful gene selection and embeddings of Smart-seq2 datasets. Further, we suggest that amplification distributions across several sequencing protocols can be described by a broken power law. The resulting compound model captures previously unexplained overdispersion and zero-inflation patterns in non-UMI data.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于无UMI的单细胞RNA-seq数据标准化的化合物模型和Pearson残差。

在下游分析能够揭示单细胞RNA测序数据中的生物信号之前，需要归一化和方差稳定来消除技术噪声。最近，基于负二项模型的皮尔逊残差被认为是一种有效的归一化方法。这些方法是为基于UMI的测序方案开发的，其中独特的分子标识符（UMI）通过跟踪原始分子来帮助去除PCR扩增噪声。相反，像Smart-seq2这样的全长协议缺乏UMI，并且保留了放大噪声，使得负二项模型不适用。在这里，我们通过将皮尔逊残差建模为一个复合过程，将其扩展到这样的读取计数数据：我们假设捕获的RNA分子遵循负二项式分布，但根据扩增分布进行复制。在这个模型的基础上，我们引入了复合Pearson残差，并表明它们可以在不明确知道放大分布的情况下解析获得。此外，我们证明了复合Pearson残差导致了具有生物学意义的基因选择和复杂Smart-seq2数据集的低维嵌入。最后，我们实证研究了几种测序方案中的扩增分布，并表明它们可以用破幂律来描述。我们表明，得到的复合分布捕捉到了读取计数数据的过度分散和零膨胀模式特征。总之，复合Pearson残差提供了一种基于简单机制假设的有效方法来规范读取计数数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量