sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures

Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan
{"title":"sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures","authors":"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan","doi":"arxiv-2405.03726","DOIUrl":null,"url":null,"abstract":"Influenced by breakthroughs in LLMs, single-cell foundation models are\nemerging. While these models show successful performance in cell type\nclustering, phenotype classification, and gene perturbation response\nprediction, it remains to be seen if a simpler model could achieve comparable\nor better results, especially with limited data. This is important, as the\nquantity and quality of single-cell data typically fall short of the standards\nin textual data used for training LLMs. Single-cell sequencing often suffers\nfrom technical artifacts, dropout events, and batch effects. These challenges\nare compounded in a weakly supervised setting, where the labels of cell states\ncan be noisy, further complicating the analysis. To tackle these challenges, we\npresent sc-OTGM, streamlined with less than 500K parameters, making it\napproximately 100x more compact than the foundation models, offering an\nefficient alternative. sc-OTGM is an unsupervised model grounded in the\ninductive bias that the scRNAseq data can be generated from a combination of\nthe finite multivariate Gaussian distributions. The core function of sc-OTGM is\nto create a probabilistic latent space utilizing a GMM as its prior\ndistribution and distinguish between distinct cell populations by learning\ntheir respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\ndetermine the OT plan across these PDFs within the GMM framework. We evaluated\nour model against a CRISPR-mediated perturbation dataset, called CROP-seq,\nconsisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\nis effective in cell state classification, aids in the analysis of differential\ngene expression, and ranks genes for target identification through a\nrecommender system. It also predicts the effects of single-gene perturbations\non downstream gene regulation and generates synthetic scRNA-seq data\nconditioned on specific cell states.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.03726","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Influenced by breakthroughs in LLMs, single-cell foundation models are emerging. While these models show successful performance in cell type clustering, phenotype classification, and gene perturbation response prediction, it remains to be seen if a simpler model could achieve comparable or better results, especially with limited data. This is important, as the quantity and quality of single-cell data typically fall short of the standards in textual data used for training LLMs. Single-cell sequencing often suffers from technical artifacts, dropout events, and batch effects. These challenges are compounded in a weakly supervised setting, where the labels of cell states can be noisy, further complicating the analysis. To tackle these challenges, we present sc-OTGM, streamlined with less than 500K parameters, making it approximately 100x more compact than the foundation models, offering an efficient alternative. sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated from a combination of the finite multivariate Gaussian distributions. The core function of sc-OTGM is to create a probabilistic latent space utilizing a GMM as its prior distribution and distinguish between distinct cell populations by learning their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to determine the OT plan across these PDFs within the GMM framework. We evaluated our model against a CRISPR-mediated perturbation dataset, called CROP-seq, consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification through a recommender system. It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
sc-OTGM:通过求解高斯混合物平面上的最优质量输运建立单细胞扰动模型
受 LLMs 突破性进展的影响,单细胞基础模型正在兴起。虽然这些模型在细胞类型聚类、表型分类和基因扰动反应预测等方面取得了成功,但一个更简单的模型是否能取得类似或更好的结果,尤其是在数据有限的情况下,还有待观察。这一点很重要,因为单细胞数据的数量和质量通常达不到用于训练 LLM 的文本数据标准。单细胞测序常常受到技术伪影、丢失事件和批次效应的影响。在弱监督环境下,这些挑战变得更加复杂,因为细胞状态的标签可能存在噪声,从而使分析变得更加复杂。为了应对这些挑战,我们提出了 sc-OTGM,它的参数少于 500K,比基础模型精简了约 100 倍,提供了一种高效的替代方法。sc-OTGM 是一种无监督模型,基于 scRNAseq 数据可以从有限多元高斯分布的组合中生成这一诱导偏差。sc-OTGM 的核心功能是利用 GMM 作为其先验分布来创建一个概率潜空间,并通过学习各自的边际 PDF 来区分不同的细胞群。它使用 "命中运行马尔可夫链采样器"(Hit-and-Run Markov Chain sampler)在 GMM 框架内确定这些边际前值的 OT 计划。我们用 CRISPR 介导的扰动数据集(CROP-seq)评估了我们的模型,该数据集由 57 个单基因扰动组成。结果表明,sc-OTGM 能有效地进行细胞状态分类,帮助分析差异基因的表达,并通过推荐系统对基因进行排序以确定目标。它还能预测单基因扰动对下游基因调控的影响,并生成以特定细胞状态为条件的合成 scRNA-seq 数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking wgatools: an ultrafast toolkit for manipulating whole genome alignments Selecting Differential Splicing Methods: Practical Considerations Advancements in colored k-mer sets: essentials for the curious Advancements in practical k-mer sets: essentials for the curious
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1