Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan
{"title":"sc-OTGM:通过求解高斯混合物平面上的最优质量输运建立单细胞扰动模型","authors":"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan","doi":"arxiv-2405.03726","DOIUrl":null,"url":null,"abstract":"Influenced by breakthroughs in LLMs, single-cell foundation models are\nemerging. While these models show successful performance in cell type\nclustering, phenotype classification, and gene perturbation response\nprediction, it remains to be seen if a simpler model could achieve comparable\nor better results, especially with limited data. This is important, as the\nquantity and quality of single-cell data typically fall short of the standards\nin textual data used for training LLMs. Single-cell sequencing often suffers\nfrom technical artifacts, dropout events, and batch effects. These challenges\nare compounded in a weakly supervised setting, where the labels of cell states\ncan be noisy, further complicating the analysis. To tackle these challenges, we\npresent sc-OTGM, streamlined with less than 500K parameters, making it\napproximately 100x more compact than the foundation models, offering an\nefficient alternative. sc-OTGM is an unsupervised model grounded in the\ninductive bias that the scRNAseq data can be generated from a combination of\nthe finite multivariate Gaussian distributions. The core function of sc-OTGM is\nto create a probabilistic latent space utilizing a GMM as its prior\ndistribution and distinguish between distinct cell populations by learning\ntheir respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\ndetermine the OT plan across these PDFs within the GMM framework. We evaluated\nour model against a CRISPR-mediated perturbation dataset, called CROP-seq,\nconsisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\nis effective in cell state classification, aids in the analysis of differential\ngene expression, and ranks genes for target identification through a\nrecommender system. It also predicts the effects of single-gene perturbations\non downstream gene regulation and generates synthetic scRNA-seq data\nconditioned on specific cell states.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures\",\"authors\":\"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan\",\"doi\":\"arxiv-2405.03726\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Influenced by breakthroughs in LLMs, single-cell foundation models are\\nemerging. While these models show successful performance in cell type\\nclustering, phenotype classification, and gene perturbation response\\nprediction, it remains to be seen if a simpler model could achieve comparable\\nor better results, especially with limited data. This is important, as the\\nquantity and quality of single-cell data typically fall short of the standards\\nin textual data used for training LLMs. Single-cell sequencing often suffers\\nfrom technical artifacts, dropout events, and batch effects. These challenges\\nare compounded in a weakly supervised setting, where the labels of cell states\\ncan be noisy, further complicating the analysis. To tackle these challenges, we\\npresent sc-OTGM, streamlined with less than 500K parameters, making it\\napproximately 100x more compact than the foundation models, offering an\\nefficient alternative. sc-OTGM is an unsupervised model grounded in the\\ninductive bias that the scRNAseq data can be generated from a combination of\\nthe finite multivariate Gaussian distributions. The core function of sc-OTGM is\\nto create a probabilistic latent space utilizing a GMM as its prior\\ndistribution and distinguish between distinct cell populations by learning\\ntheir respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\\ndetermine the OT plan across these PDFs within the GMM framework. We evaluated\\nour model against a CRISPR-mediated perturbation dataset, called CROP-seq,\\nconsisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\\nis effective in cell state classification, aids in the analysis of differential\\ngene expression, and ranks genes for target identification through a\\nrecommender system. It also predicts the effects of single-gene perturbations\\non downstream gene regulation and generates synthetic scRNA-seq data\\nconditioned on specific cell states.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"43 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.03726\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.03726","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures
Influenced by breakthroughs in LLMs, single-cell foundation models are
emerging. While these models show successful performance in cell type
clustering, phenotype classification, and gene perturbation response
prediction, it remains to be seen if a simpler model could achieve comparable
or better results, especially with limited data. This is important, as the
quantity and quality of single-cell data typically fall short of the standards
in textual data used for training LLMs. Single-cell sequencing often suffers
from technical artifacts, dropout events, and batch effects. These challenges
are compounded in a weakly supervised setting, where the labels of cell states
can be noisy, further complicating the analysis. To tackle these challenges, we
present sc-OTGM, streamlined with less than 500K parameters, making it
approximately 100x more compact than the foundation models, offering an
efficient alternative. sc-OTGM is an unsupervised model grounded in the
inductive bias that the scRNAseq data can be generated from a combination of
the finite multivariate Gaussian distributions. The core function of sc-OTGM is
to create a probabilistic latent space utilizing a GMM as its prior
distribution and distinguish between distinct cell populations by learning
their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to
determine the OT plan across these PDFs within the GMM framework. We evaluated
our model against a CRISPR-mediated perturbation dataset, called CROP-seq,
consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM
is effective in cell state classification, aids in the analysis of differential
gene expression, and ranks genes for target identification through a
recommender system. It also predicts the effects of single-gene perturbations
on downstream gene regulation and generates synthetic scRNA-seq data
conditioned on specific cell states.