sc-OTGM:通过求解高斯混合物平面上的最优质量输运建立单细胞扰动模型

Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan
{"title":"sc-OTGM:通过求解高斯混合物平面上的最优质量输运建立单细胞扰动模型","authors":"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan","doi":"arxiv-2405.03726","DOIUrl":null,"url":null,"abstract":"Influenced by breakthroughs in LLMs, single-cell foundation models are\nemerging. While these models show successful performance in cell type\nclustering, phenotype classification, and gene perturbation response\nprediction, it remains to be seen if a simpler model could achieve comparable\nor better results, especially with limited data. This is important, as the\nquantity and quality of single-cell data typically fall short of the standards\nin textual data used for training LLMs. Single-cell sequencing often suffers\nfrom technical artifacts, dropout events, and batch effects. These challenges\nare compounded in a weakly supervised setting, where the labels of cell states\ncan be noisy, further complicating the analysis. To tackle these challenges, we\npresent sc-OTGM, streamlined with less than 500K parameters, making it\napproximately 100x more compact than the foundation models, offering an\nefficient alternative. sc-OTGM is an unsupervised model grounded in the\ninductive bias that the scRNAseq data can be generated from a combination of\nthe finite multivariate Gaussian distributions. The core function of sc-OTGM is\nto create a probabilistic latent space utilizing a GMM as its prior\ndistribution and distinguish between distinct cell populations by learning\ntheir respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\ndetermine the OT plan across these PDFs within the GMM framework. We evaluated\nour model against a CRISPR-mediated perturbation dataset, called CROP-seq,\nconsisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\nis effective in cell state classification, aids in the analysis of differential\ngene expression, and ranks genes for target identification through a\nrecommender system. It also predicts the effects of single-gene perturbations\non downstream gene regulation and generates synthetic scRNA-seq data\nconditioned on specific cell states.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures\",\"authors\":\"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan\",\"doi\":\"arxiv-2405.03726\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Influenced by breakthroughs in LLMs, single-cell foundation models are\\nemerging. While these models show successful performance in cell type\\nclustering, phenotype classification, and gene perturbation response\\nprediction, it remains to be seen if a simpler model could achieve comparable\\nor better results, especially with limited data. This is important, as the\\nquantity and quality of single-cell data typically fall short of the standards\\nin textual data used for training LLMs. Single-cell sequencing often suffers\\nfrom technical artifacts, dropout events, and batch effects. These challenges\\nare compounded in a weakly supervised setting, where the labels of cell states\\ncan be noisy, further complicating the analysis. To tackle these challenges, we\\npresent sc-OTGM, streamlined with less than 500K parameters, making it\\napproximately 100x more compact than the foundation models, offering an\\nefficient alternative. sc-OTGM is an unsupervised model grounded in the\\ninductive bias that the scRNAseq data can be generated from a combination of\\nthe finite multivariate Gaussian distributions. The core function of sc-OTGM is\\nto create a probabilistic latent space utilizing a GMM as its prior\\ndistribution and distinguish between distinct cell populations by learning\\ntheir respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\\ndetermine the OT plan across these PDFs within the GMM framework. We evaluated\\nour model against a CRISPR-mediated perturbation dataset, called CROP-seq,\\nconsisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\\nis effective in cell state classification, aids in the analysis of differential\\ngene expression, and ranks genes for target identification through a\\nrecommender system. It also predicts the effects of single-gene perturbations\\non downstream gene regulation and generates synthetic scRNA-seq data\\nconditioned on specific cell states.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.03726\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.03726","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

受 LLMs 突破性进展的影响,单细胞基础模型正在兴起。虽然这些模型在细胞类型聚类、表型分类和基因扰动反应预测等方面取得了成功,但一个更简单的模型是否能取得类似或更好的结果,尤其是在数据有限的情况下,还有待观察。这一点很重要,因为单细胞数据的数量和质量通常达不到用于训练 LLM 的文本数据标准。单细胞测序常常受到技术伪影、丢失事件和批次效应的影响。在弱监督环境下,这些挑战变得更加复杂,因为细胞状态的标签可能存在噪声,从而使分析变得更加复杂。为了应对这些挑战,我们提出了 sc-OTGM,它的参数少于 500K,比基础模型精简了约 100 倍,提供了一种高效的替代方法。sc-OTGM 是一种无监督模型,基于 scRNAseq 数据可以从有限多元高斯分布的组合中生成这一诱导偏差。sc-OTGM 的核心功能是利用 GMM 作为其先验分布来创建一个概率潜空间,并通过学习各自的边际 PDF 来区分不同的细胞群。它使用 "命中运行马尔可夫链采样器"(Hit-and-Run Markov Chain sampler)在 GMM 框架内确定这些边际前值的 OT 计划。我们用 CRISPR 介导的扰动数据集(CROP-seq)评估了我们的模型,该数据集由 57 个单基因扰动组成。结果表明,sc-OTGM 能有效地进行细胞状态分类,帮助分析差异基因的表达,并通过推荐系统对基因进行排序以确定目标。它还能预测单基因扰动对下游基因调控的影响,并生成以特定细胞状态为条件的合成 scRNA-seq 数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures
Influenced by breakthroughs in LLMs, single-cell foundation models are emerging. While these models show successful performance in cell type clustering, phenotype classification, and gene perturbation response prediction, it remains to be seen if a simpler model could achieve comparable or better results, especially with limited data. This is important, as the quantity and quality of single-cell data typically fall short of the standards in textual data used for training LLMs. Single-cell sequencing often suffers from technical artifacts, dropout events, and batch effects. These challenges are compounded in a weakly supervised setting, where the labels of cell states can be noisy, further complicating the analysis. To tackle these challenges, we present sc-OTGM, streamlined with less than 500K parameters, making it approximately 100x more compact than the foundation models, offering an efficient alternative. sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated from a combination of the finite multivariate Gaussian distributions. The core function of sc-OTGM is to create a probabilistic latent space utilizing a GMM as its prior distribution and distinguish between distinct cell populations by learning their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to determine the OT plan across these PDFs within the GMM framework. We evaluated our model against a CRISPR-mediated perturbation dataset, called CROP-seq, consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification through a recommender system. It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking wgatools: an ultrafast toolkit for manipulating whole genome alignments Selecting Differential Splicing Methods: Practical Considerations Advancements in colored k-mer sets: essentials for the curious Advancements in practical k-mer sets: essentials for the curious
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1