sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures

arXiv - QuanBio - Genomics Pub Date : 2024-05-06 DOI:arxiv-2405.03726

Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan

{"title":"sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures","authors":"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan","doi":"arxiv-2405.03726","DOIUrl":null,"url":null,"abstract":"Influenced by breakthroughs in LLMs, single-cell foundation models are\nemerging. While these models show successful performance in cell type\nclustering, phenotype classification, and gene perturbation response\nprediction, it remains to be seen if a simpler model could achieve comparable\nor better results, especially with limited data. This is important, as the\nquantity and quality of single-cell data typically fall short of the standards\nin textual data used for training LLMs. Single-cell sequencing often suffers\nfrom technical artifacts, dropout events, and batch effects. These challenges\nare compounded in a weakly supervised setting, where the labels of cell states\ncan be noisy, further complicating the analysis. To tackle these challenges, we\npresent sc-OTGM, streamlined with less than 500K parameters, making it\napproximately 100x more compact than the foundation models, offering an\nefficient alternative. sc-OTGM is an unsupervised model grounded in the\ninductive bias that the scRNAseq data can be generated from a combination of\nthe finite multivariate Gaussian distributions. The core function of sc-OTGM is\nto create a probabilistic latent space utilizing a GMM as its prior\ndistribution and distinguish between distinct cell populations by learning\ntheir respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\ndetermine the OT plan across these PDFs within the GMM framework. We evaluated\nour model against a CRISPR-mediated perturbation dataset, called CROP-seq,\nconsisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\nis effective in cell state classification, aids in the analysis of differential\ngene expression, and ranks genes for target identification through a\nrecommender system. It also predicts the effects of single-gene perturbations\non downstream gene regulation and generates synthetic scRNA-seq data\nconditioned on specific cell states.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.03726","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Influenced by breakthroughs in LLMs, single-cell foundation models are emerging. While these models show successful performance in cell type clustering, phenotype classification, and gene perturbation response prediction, it remains to be seen if a simpler model could achieve comparable or better results, especially with limited data. This is important, as the quantity and quality of single-cell data typically fall short of the standards in textual data used for training LLMs. Single-cell sequencing often suffers from technical artifacts, dropout events, and batch effects. These challenges are compounded in a weakly supervised setting, where the labels of cell states can be noisy, further complicating the analysis. To tackle these challenges, we present sc-OTGM, streamlined with less than 500K parameters, making it approximately 100x more compact than the foundation models, offering an efficient alternative. sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated from a combination of the finite multivariate Gaussian distributions. The core function of sc-OTGM is to create a probabilistic latent space utilizing a GMM as its prior distribution and distinguish between distinct cell populations by learning their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to determine the OT plan across these PDFs within the GMM framework. We evaluated our model against a CRISPR-mediated perturbation dataset, called CROP-seq, consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification through a recommender system. It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

sc-OTGM：通过求解高斯混合物平面上的最优质量输运建立单细胞扰动模型

受 LLMs 突破性进展的影响，单细胞基础模型正在兴起。虽然这些模型在细胞类型聚类、表型分类和基因扰动反应预测等方面取得了成功，但一个更简单的模型是否能取得类似或更好的结果，尤其是在数据有限的情况下，还有待观察。这一点很重要，因为单细胞数据的数量和质量通常达不到用于训练 LLM 的文本数据标准。单细胞测序常常受到技术伪影、丢失事件和批次效应的影响。在弱监督环境下，这些挑战变得更加复杂，因为细胞状态的标签可能存在噪声，从而使分析变得更加复杂。为了应对这些挑战，我们提出了 sc-OTGM，它的参数少于 500K，比基础模型精简了约 100 倍，提供了一种高效的替代方法。sc-OTGM 是一种无监督模型，基于 scRNAseq 数据可以从有限多元高斯分布的组合中生成这一诱导偏差。sc-OTGM 的核心功能是利用 GMM 作为其先验分布来创建一个概率潜空间，并通过学习各自的边际 PDF 来区分不同的细胞群。它使用 "命中运行马尔可夫链采样器"（Hit-and-Run Markov Chain sampler）在 GMM 框架内确定这些边际前值的 OT 计划。我们用 CRISPR 介导的扰动数据集（CROP-seq）评估了我们的模型，该数据集由 57 个单基因扰动组成。结果表明，sc-OTGM 能有效地进行细胞状态分类，帮助分析差异基因的表达，并通过推荐系统对基因进行排序以确定目标。它还能预测单基因扰动对下游基因调控的影响，并生成以特定细胞状态为条件的合成 scRNA-seq 数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量