Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna
{"title":"$Γ$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data","authors":"Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna","doi":"arxiv-2403.01078","DOIUrl":null,"url":null,"abstract":"Natural systems with emergent behaviors often organize along low-dimensional\nsubsets of high-dimensional spaces. For example, despite the tens of thousands\nof genes in the human genome, the principled study of genomics is fruitful\nbecause biological processes rely on coordinated organization that results in\nlower dimensional phenotypes. To uncover this organization, many nonlinear\ndimensionality reduction techniques have successfully embedded high-dimensional\ndata into low-dimensional spaces by preserving local similarities between data\npoints. However, the nonlinearities in these methods allow for too much\ncurvature to preserve general trends across multiple non-neighboring data\nclusters, thereby limiting their interpretability and generalizability to\nout-of-distribution data. Here, we address both of these limitations by\nregularizing the curvature of manifolds generated by variational autoencoders,\na process we coin ``$\\Gamma$-VAE''. We demonstrate its utility using two\nexample data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the\nGenotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage\ntracing experiment in hematopoietic stem cell differentiation. We find that the\nresulting regularized manifolds identify mesoscale structure associated with\ndifferent cancer cell types, and accurately re-embed tissues from completely\nunseen, out-of distribution cancers as if they were originally trained on them.\nFinally, we show that preserving long-range relationships to differentiated\ncells separates undifferentiated cells -- which have not yet specialized --\naccording to their eventual fate. Broadly, we anticipate that regularizing the\ncurvature of generative models will enable more consistent, predictive, and\ngeneralizable models in any high-dimensional system with emergent\nlow-dimensional behavior.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.01078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Natural systems with emergent behaviors often organize along low-dimensional
subsets of high-dimensional spaces. For example, despite the tens of thousands
of genes in the human genome, the principled study of genomics is fruitful
because biological processes rely on coordinated organization that results in
lower dimensional phenotypes. To uncover this organization, many nonlinear
dimensionality reduction techniques have successfully embedded high-dimensional
data into low-dimensional spaces by preserving local similarities between data
points. However, the nonlinearities in these methods allow for too much
curvature to preserve general trends across multiple non-neighboring data
clusters, thereby limiting their interpretability and generalizability to
out-of-distribution data. Here, we address both of these limitations by
regularizing the curvature of manifolds generated by variational autoencoders,
a process we coin ``$\Gamma$-VAE''. We demonstrate its utility using two
example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the
Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage
tracing experiment in hematopoietic stem cell differentiation. We find that the
resulting regularized manifolds identify mesoscale structure associated with
different cancer cell types, and accurately re-embed tissues from completely
unseen, out-of distribution cancers as if they were originally trained on them.
Finally, we show that preserving long-range relationships to differentiated
cells separates undifferentiated cells -- which have not yet specialized --
according to their eventual fate. Broadly, we anticipate that regularizing the
curvature of generative models will enable more consistent, predictive, and
generalizable models in any high-dimensional system with emergent
low-dimensional behavior.