C-ziptf：零膨胀多维基因组学数据的稳定张量因式分解。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2024-10-05 DOI:10.1186/s12859-024-05886-4

Daniel Chafamo, Vignesh Shanmugam, Neriman Tokcan

{"title":"C-ziptf：零膨胀多维基因组学数据的稳定张量因式分解。","authors":"Daniel Chafamo, Vignesh Shanmugam, Neriman Tokcan","doi":"10.1186/s12859-024-05886-4","DOIUrl":null,"url":null,"abstract":"In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to <math><mrow><mn>2.4</mn> <mo>×</mo></mrow> </math> better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11456250/pdf/","citationCount":"0","resultStr":"{\"title\":\"C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data.\",\"authors\":\"Daniel Chafamo, Vignesh Shanmugam, Neriman Tokcan\",\"doi\":\"10.1186/s12859-024-05886-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to <math><mrow><mn>2.4</mn> <mo>×</mo></mrow> </math> better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments .\",\"PeriodicalId\":8958,\"journal\":{\"name\":\"BMC Bioinformatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11456250/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12859-024-05886-4\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05886-4","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

在过去的二十年里，基因组学取得了长足的进步，其中单细胞 RNA 测序（scRNA-seq）是一个重要的里程碑。ScRNA-seq 提供了对细胞多样性的无与伦比的洞察力，并促进了跨越多种条件和样本的多样化研究，从而产生了大量复杂的多维基因组学数据。这凸显了对能够处理此类基因组学数据的复杂性和多维性的稳健方法的需求。此外，单细胞数据还因捕获效率低和丢失效应等问题而面临稀疏性问题。张量因子化（TF）已成为从多维基因组学数据中揭示复杂模式的强大工具。基于最大似然估计的经典张量因式分解方法在处理零膨胀计数数据时非常吃力，而张量因式分解固有的随机性使结果解释和可重复性变得更加复杂。我们的论文介绍了零膨胀泊松张量因式分解（ZIPTF），这是一种用于高维零膨胀计数数据因式分解的新方法。我们还介绍了共识-ZIPTF（Consensus-ZIPTF），它将 ZIPTF 与基于共识的方法合并，以解决随机性问题。我们在合成的零膨胀计数数据、模拟的 scRNA-seq 数据和真实的多样本多条件 scRNA-seq 数据集上评估了我们提出的方法。ZIPTF 始终优于基线矩阵和张量因式分解方法，在零膨胀数据方面显示出更高的重建精度。在处理高概率的过零数据时，ZIPTF 的准确度最高可提高 2.4 倍。此外，C-ZIPTF 还显著增强了因式分解的一致性。在合成和真实 scRNA-seq 数据上进行测试时，ZIPTF 和 C-ZIPTF 始终能发现已知的、具有生物学意义的基因表达程序。访问我们的数据和代码：https://github.com/klarman-cell-observatory/scBTF 和 https://github.com/klarman-cell-observatory/scbtf_experiments 。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data.

In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to $2.4 \times$ better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.