Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan
{"title":"sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures","authors":"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan","doi":"arxiv-2405.03726","DOIUrl":null,"url":null,"abstract":"Influenced by breakthroughs in LLMs, single-cell foundation models are\nemerging. While these models show successful performance in cell type\nclustering, phenotype classification, and gene perturbation response\nprediction, it remains to be seen if a simpler model could achieve comparable\nor better results, especially with limited data. This is important, as the\nquantity and quality of single-cell data typically fall short of the standards\nin textual data used for training LLMs. Single-cell sequencing often suffers\nfrom technical artifacts, dropout events, and batch effects. These challenges\nare compounded in a weakly supervised setting, where the labels of cell states\ncan be noisy, further complicating the analysis. To tackle these challenges, we\npresent sc-OTGM, streamlined with less than 500K parameters, making it\napproximately 100x more compact than the foundation models, offering an\nefficient alternative. sc-OTGM is an unsupervised model grounded in the\ninductive bias that the scRNAseq data can be generated from a combination of\nthe finite multivariate Gaussian distributions. The core function of sc-OTGM is\nto create a probabilistic latent space utilizing a GMM as its prior\ndistribution and distinguish between distinct cell populations by learning\ntheir respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\ndetermine the OT plan across these PDFs within the GMM framework. We evaluated\nour model against a CRISPR-mediated perturbation dataset, called CROP-seq,\nconsisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\nis effective in cell state classification, aids in the analysis of differential\ngene expression, and ranks genes for target identification through a\nrecommender system. It also predicts the effects of single-gene perturbations\non downstream gene regulation and generates synthetic scRNA-seq data\nconditioned on specific cell states.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.03726","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Influenced by breakthroughs in LLMs, single-cell foundation models are
emerging. While these models show successful performance in cell type
clustering, phenotype classification, and gene perturbation response
prediction, it remains to be seen if a simpler model could achieve comparable
or better results, especially with limited data. This is important, as the
quantity and quality of single-cell data typically fall short of the standards
in textual data used for training LLMs. Single-cell sequencing often suffers
from technical artifacts, dropout events, and batch effects. These challenges
are compounded in a weakly supervised setting, where the labels of cell states
can be noisy, further complicating the analysis. To tackle these challenges, we
present sc-OTGM, streamlined with less than 500K parameters, making it
approximately 100x more compact than the foundation models, offering an
efficient alternative. sc-OTGM is an unsupervised model grounded in the
inductive bias that the scRNAseq data can be generated from a combination of
the finite multivariate Gaussian distributions. The core function of sc-OTGM is
to create a probabilistic latent space utilizing a GMM as its prior
distribution and distinguish between distinct cell populations by learning
their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to
determine the OT plan across these PDFs within the GMM framework. We evaluated
our model against a CRISPR-mediated perturbation dataset, called CROP-seq,
consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM
is effective in cell state classification, aids in the analysis of differential
gene expression, and ranks genes for target identification through a
recommender system. It also predicts the effects of single-gene perturbations
on downstream gene regulation and generates synthetic scRNA-seq data
conditioned on specific cell states.