Spatially resolved transcriptomics represents a significant advancement in single-cell analysis by offering both gene expression data and their corresponding physical locations. However, this high degree of spatial resolution entails a drawback, as the resulting spatial transcriptomic data at the cellular level is notably plagued by a high incidence of missing values. Furthermore, most existing imputation methods either overlook the spatial information between spots or compromise the overall gene expression data distribution. To address these challenges, our primary focus is on effectively utilizing the spatial location information within spatial transcriptomic data to impute missing values, while preserving the overall data distribution. We introduce textbf{stMCDI}, a novel conditional diffusion model for spatial transcriptomics data imputation, which employs a denoising network trained using randomly masked data portions as guidance, with the unmasked data serving as conditions. Additionally, it utilizes a GNN encoder to integrate the spatial position information, thereby enhancing model performance. The results obtained from spatial transcriptomics datasets elucidate the performance of our methods relative to existing approaches.
{"title":"stMCDI: Masked Conditional Diffusion Model with Graph Neural Network for Spatial Transcriptomics Data Imputation","authors":"Xiaoyu Li, Wenwen Min, Shunfang Wang, Changmiao Wang, Taosheng Xu","doi":"arxiv-2403.10863","DOIUrl":"https://doi.org/arxiv-2403.10863","url":null,"abstract":"Spatially resolved transcriptomics represents a significant advancement in\u0000single-cell analysis by offering both gene expression data and their\u0000corresponding physical locations. However, this high degree of spatial\u0000resolution entails a drawback, as the resulting spatial transcriptomic data at\u0000the cellular level is notably plagued by a high incidence of missing values.\u0000Furthermore, most existing imputation methods either overlook the spatial\u0000information between spots or compromise the overall gene expression data\u0000distribution. To address these challenges, our primary focus is on effectively\u0000utilizing the spatial location information within spatial transcriptomic data\u0000to impute missing values, while preserving the overall data distribution. We\u0000introduce textbf{stMCDI}, a novel conditional diffusion model for spatial\u0000transcriptomics data imputation, which employs a denoising network trained\u0000using randomly masked data portions as guidance, with the unmasked data serving\u0000as conditions. Additionally, it utilizes a GNN encoder to integrate the spatial\u0000position information, thereby enhancing model performance. The results obtained\u0000from spatial transcriptomics datasets elucidate the performance of our methods\u0000relative to existing approaches.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"120 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to study individual cellular distinctions and uncover unique cell characteristics. However, a significant technical challenge in scRNA-seq analysis is the occurrence of "dropout" events, where certain gene expressions cannot be detected. This issue is particularly pronounced in genes with low or sparse expression levels, impacting the precision and interpretability of the obtained data. To address this challenge, various imputation methods have been implemented to predict such missing values, aiming to enhance the analysis's accuracy and usefulness. A prevailing hypothesis posits that scRNA-seq data conforms to a zero-inflated negative binomial (ZINB) distribution. Consequently, methods have been developed to model the data according to this distribution. Recent trends in scRNA-seq analysis have seen the emergence of deep learning approaches. Some techniques, such as the variational autoencoder, incorporate the ZINB distribution as a model loss function. Graph-based methods like Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) have also gained attention as deep learning methodologies for scRNA-seq analysis. This study introduces scVGAE, an innovative approach integrating GCN into a variational autoencoder framework while utilizing a ZINB loss function. This integration presents a promising avenue for effectively addressing dropout events in scRNA-seq data, thereby enhancing the accuracy and reliability of downstream analyses. scVGAE outperforms other methods in cell clustering, with the best performance in 11 out of 14 datasets. Ablation study shows all components of scVGAE are necessary. scVGAE is implemented in Python and downloadable at https://github.com/inoue0426/scVGAE.
{"title":"scVGAE: A Novel Approach using ZINB-Based Variational Graph Autoencoder for Single-Cell RNA-Seq Imputation","authors":"Yoshitaka Inoue","doi":"arxiv-2403.08959","DOIUrl":"https://doi.org/arxiv-2403.08959","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to\u0000study individual cellular distinctions and uncover unique cell characteristics.\u0000However, a significant technical challenge in scRNA-seq analysis is the\u0000occurrence of \"dropout\" events, where certain gene expressions cannot be\u0000detected. This issue is particularly pronounced in genes with low or sparse\u0000expression levels, impacting the precision and interpretability of the obtained\u0000data. To address this challenge, various imputation methods have been\u0000implemented to predict such missing values, aiming to enhance the analysis's\u0000accuracy and usefulness. A prevailing hypothesis posits that scRNA-seq data\u0000conforms to a zero-inflated negative binomial (ZINB) distribution.\u0000Consequently, methods have been developed to model the data according to this\u0000distribution. Recent trends in scRNA-seq analysis have seen the emergence of\u0000deep learning approaches. Some techniques, such as the variational autoencoder,\u0000incorporate the ZINB distribution as a model loss function. Graph-based methods\u0000like Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) have\u0000also gained attention as deep learning methodologies for scRNA-seq analysis.\u0000This study introduces scVGAE, an innovative approach integrating GCN into a\u0000variational autoencoder framework while utilizing a ZINB loss function. This\u0000integration presents a promising avenue for effectively addressing dropout\u0000events in scRNA-seq data, thereby enhancing the accuracy and reliability of\u0000downstream analyses. scVGAE outperforms other methods in cell clustering, with\u0000the best performance in 11 out of 14 datasets. Ablation study shows all\u0000components of scVGAE are necessary. scVGAE is implemented in Python and\u0000downloadable at https://github.com/inoue0426/scVGAE.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140154982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative models for multimodal data permit the identification of latent factors that may be associated with important determinants of observed data heterogeneity. Common or shared factors could be important for explaining variation across modalities whereas other factors may be private and important only for the explanation of a single modality. Multimodal Variational Autoencoders, such as MVAE and MMVAE, are a natural choice for inferring those underlying latent factors and separating shared variation from private. In this work, we investigate their capability to reliably perform this disentanglement. In particular, we highlight a challenging problem setting where modality-specific variation dominates the shared signal. Taking a cross-modal prediction perspective, we demonstrate limitations of existing models, and propose a modification how to make them more robust to modality-specific variation. Our findings are supported by experiments on synthetic as well as various real-world multi-omics data sets.
{"title":"Disentangling shared and private latent factors in multimodal Variational Autoencoders","authors":"Kaspar Märtens, Christopher Yau","doi":"arxiv-2403.06338","DOIUrl":"https://doi.org/arxiv-2403.06338","url":null,"abstract":"Generative models for multimodal data permit the identification of latent\u0000factors that may be associated with important determinants of observed data\u0000heterogeneity. Common or shared factors could be important for explaining\u0000variation across modalities whereas other factors may be private and important\u0000only for the explanation of a single modality. Multimodal Variational\u0000Autoencoders, such as MVAE and MMVAE, are a natural choice for inferring those\u0000underlying latent factors and separating shared variation from private. In this\u0000work, we investigate their capability to reliably perform this disentanglement.\u0000In particular, we highlight a challenging problem setting where\u0000modality-specific variation dominates the shared signal. Taking a cross-modal\u0000prediction perspective, we demonstrate limitations of existing models, and\u0000propose a modification how to make them more robust to modality-specific\u0000variation. Our findings are supported by experiments on synthetic as well as\u0000various real-world multi-omics data sets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140106856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The revolutionary progress in development of next-generation sequencing (NGS) technologies has made it possible to deliver accurate genomic information in a timely manner. Over the past several years, NGS has transformed biomedical and clinical research and found its application in the field of personalized medicine. Here we discuss the rise of personalized medicine and the history of NGS. We discuss current applications and uses of NGS in medicine, including infectious diseases, oncology, genomic medicine, and dermatology. We provide a brief discussion of selected studies where NGS was used to respond to wide variety of questions in biomedical research and clinical medicine. Finally, we discuss the challenges of implementing NGS into routine clinical use.
{"title":"The use of next-generation sequencing in personalized medicine","authors":"Liya Popova, Valerie J. Carabetta","doi":"arxiv-2403.03688","DOIUrl":"https://doi.org/arxiv-2403.03688","url":null,"abstract":"The revolutionary progress in development of next-generation sequencing (NGS)\u0000technologies has made it possible to deliver accurate genomic information in a\u0000timely manner. Over the past several years, NGS has transformed biomedical and\u0000clinical research and found its application in the field of personalized\u0000medicine. Here we discuss the rise of personalized medicine and the history of\u0000NGS. We discuss current applications and uses of NGS in medicine, including\u0000infectious diseases, oncology, genomic medicine, and dermatology. We provide a\u0000brief discussion of selected studies where NGS was used to respond to wide\u0000variety of questions in biomedical research and clinical medicine. Finally, we\u0000discuss the challenges of implementing NGS into routine clinical use.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"2014 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140071725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Rostami, Amin Ghariyazi, Hamed Dashti, Mohammad Hossein Rohban, Hamid R. Rabiee
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) is a gene editing technology that has revolutionized the fields of biology and medicine. However, one of the challenges of using CRISPR is predicting the on-target efficacy and off-target sensitivity of single-guide RNAs (sgRNAs). This is because most existing methods are trained on separate datasets with different genes and cells, which limits their generalizability. In this paper, we propose a novel ensemble learning method for sgRNA design that is accurate and generalizable. Our method combines the predictions of multiple machine learning models to produce a single, more robust prediction. This approach allows us to learn from a wider range of data, which improves the generalizability of our model. We evaluated our method on a benchmark dataset of sgRNA designs and found that it outperformed existing methods in terms of both accuracy and generalizability. Our results suggest that our method can be used to design sgRNAs with high sensitivity and specificity, even for new genes or cells. This could have important implications for the clinical use of CRISPR, as it would allow researchers to design more effective and safer treatments for a variety of diseases.
{"title":"CRISPR: Ensemble Model","authors":"Mohammad Rostami, Amin Ghariyazi, Hamed Dashti, Mohammad Hossein Rohban, Hamid R. Rabiee","doi":"arxiv-2403.03018","DOIUrl":"https://doi.org/arxiv-2403.03018","url":null,"abstract":"Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) is a gene\u0000editing technology that has revolutionized the fields of biology and medicine.\u0000However, one of the challenges of using CRISPR is predicting the on-target\u0000efficacy and off-target sensitivity of single-guide RNAs (sgRNAs). This is\u0000because most existing methods are trained on separate datasets with different\u0000genes and cells, which limits their generalizability. In this paper, we propose\u0000a novel ensemble learning method for sgRNA design that is accurate and\u0000generalizable. Our method combines the predictions of multiple machine learning\u0000models to produce a single, more robust prediction. This approach allows us to\u0000learn from a wider range of data, which improves the generalizability of our\u0000model. We evaluated our method on a benchmark dataset of sgRNA designs and\u0000found that it outperformed existing methods in terms of both accuracy and\u0000generalizability. Our results suggest that our method can be used to design\u0000sgRNAs with high sensitivity and specificity, even for new genes or cells. This\u0000could have important implications for the clinical use of CRISPR, as it would\u0000allow researchers to design more effective and safer treatments for a variety\u0000of diseases.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"271 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140045146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Systematic characterization of biological effects to genetic perturbation is essential to the application of molecular biology and biomedicine. However, the experimental exhaustion of genetic perturbations on the genome-wide scale is challenging. Here, we show that TranscriptionNet, a deep learning model that integrates multiple biological networks to systematically predict transcriptional profiles to three types of genetic perturbations based on transcriptional profiles induced by genetic perturbations in the L1000 project: RNA interference (RNAi), clustered regularly interspaced short palindromic repeat (CRISPR) and overexpression (OE). TranscriptionNet performs better than existing approaches in predicting inducible gene expression changes for all three types of genetic perturbations. TranscriptionNet can predict transcriptional profiles for all genes in existing biological networks and increases perturbational gene expression changes for each type of genetic perturbation from a few thousand to 26,945 genes. TranscriptionNet demonstrates strong generalization ability when comparing predicted and true gene expression changes on different external tasks. Overall, TranscriptionNet can systemically predict transcriptional consequences induced by perturbing genes on a genome-wide scale and thus holds promise to systemically detect gene function and enhance drug development and target discovery.
{"title":"A genome-scale deep learning model to predict gene expression changes of genetic perturbations from multiplex biological networks","authors":"Lingmin Zhan, Yuanyuan Zhang, Yingdong Wang, Aoyi Wang, Caiping Cheng, Jinzhong Zhao, Wuxia Zhang, Peng Lia, Jianxin Chen","doi":"arxiv-2403.02724","DOIUrl":"https://doi.org/arxiv-2403.02724","url":null,"abstract":"Systematic characterization of biological effects to genetic perturbation is\u0000essential to the application of molecular biology and biomedicine. However, the\u0000experimental exhaustion of genetic perturbations on the genome-wide scale is\u0000challenging. Here, we show that TranscriptionNet, a deep learning model that\u0000integrates multiple biological networks to systematically predict\u0000transcriptional profiles to three types of genetic perturbations based on\u0000transcriptional profiles induced by genetic perturbations in the L1000 project:\u0000RNA interference (RNAi), clustered regularly interspaced short palindromic\u0000repeat (CRISPR) and overexpression (OE). TranscriptionNet performs better than\u0000existing approaches in predicting inducible gene expression changes for all\u0000three types of genetic perturbations. TranscriptionNet can predict\u0000transcriptional profiles for all genes in existing biological networks and\u0000increases perturbational gene expression changes for each type of genetic\u0000perturbation from a few thousand to 26,945 genes. TranscriptionNet demonstrates\u0000strong generalization ability when comparing predicted and true gene expression\u0000changes on different external tasks. Overall, TranscriptionNet can systemically\u0000predict transcriptional consequences induced by perturbing genes on a\u0000genome-wide scale and thus holds promise to systemically detect gene function\u0000and enhance drug development and target discovery.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140045368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brydon P. G. Wall, My Nguyen, J. Chuck Harrell, Mikhail G. Dozmorov
Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.
{"title":"Machine and deep learning methods for predicting 3D genome organization","authors":"Brydon P. G. Wall, My Nguyen, J. Chuck Harrell, Mikhail G. Dozmorov","doi":"arxiv-2403.03231","DOIUrl":"https://doi.org/arxiv-2403.03231","url":null,"abstract":"Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter\u0000interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B\u0000compartments play critical roles in a wide range of cellular processes by\u0000regulating gene expression. Recent development of chromatin conformation\u0000capture technologies has enabled genome-wide profiling of various 3D\u0000structures, even with single cells. However, current catalogs of 3D structures\u0000remain incomplete and unreliable due to differences in technology, tools, and\u0000low data resolution. Machine learning methods have emerged as an alternative to\u0000obtain missing 3D interactions and/or improve resolution. Such methods\u0000frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA\u0000sequencing information (k-mers, Transcription Factor Binding Site (TFBS)\u0000motifs), and other genomic properties to learn the associations between genomic\u0000features and chromatin interactions. In this review, we discuss computational\u0000tools for predicting three types of 3D interactions (EPIs, chromatin\u0000interactions, TAD boundaries) and analyze their pros and cons. We also point\u0000out obstacles of computational prediction of 3D interactions and suggest future\u0000research directions.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140076552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harihara Subrahmaniam Muralidharan, Jacquelyn S Michaelis, Jay Ghurye, Todd Treangen, Sergey Koren, Marcus Fedarko, Mihai Pop
Sequence differences between the strains of bacteria comprising host-associated and environmental microbiota may play a role in community assembly and influence the resilience of microbial communities to disturbances. Tools for characterizing strain-level variation within microbial communities, however, are limited in scope, focusing on just single nucleotide polymorphisms, or relying on reference-based analyses that miss complex functional and structural variants. Here, we demonstrate the power of assembly graph analysis to detect and characterize structural variants in almost 1,000 metagenomes generated as part of the Human Microbiome Project. We identify over nine million variants comprising insertion/deletion events, repeat copy-number changes, and mobile elements such as plasmids. We highlight some of the potential functional roles of these genomic changes. Our analysis revealed striking differences in the rate of variation across body sites, highlighting niche-specific mechanisms of bacterial adaptation. The structural variants we detect also include potentially novel prophage integration events, highlighting the potential use of graph-based analyses for phage discovery.
{"title":"Graph-based variant discovery reveals novel dynamics in the human microbiome","authors":"Harihara Subrahmaniam Muralidharan, Jacquelyn S Michaelis, Jay Ghurye, Todd Treangen, Sergey Koren, Marcus Fedarko, Mihai Pop","doi":"arxiv-2403.01610","DOIUrl":"https://doi.org/arxiv-2403.01610","url":null,"abstract":"Sequence differences between the strains of bacteria comprising\u0000host-associated and environmental microbiota may play a role in community\u0000assembly and influence the resilience of microbial communities to disturbances.\u0000Tools for characterizing strain-level variation within microbial communities,\u0000however, are limited in scope, focusing on just single nucleotide\u0000polymorphisms, or relying on reference-based analyses that miss complex\u0000functional and structural variants. Here, we demonstrate the power of assembly\u0000graph analysis to detect and characterize structural variants in almost 1,000\u0000metagenomes generated as part of the Human Microbiome Project. We identify over\u0000nine million variants comprising insertion/deletion events, repeat copy-number\u0000changes, and mobile elements such as plasmids. We highlight some of the\u0000potential functional roles of these genomic changes. Our analysis revealed\u0000striking differences in the rate of variation across body sites, highlighting\u0000niche-specific mechanisms of bacterial adaptation. The structural variants we\u0000detect also include potentially novel prophage integration events, highlighting\u0000the potential use of graph-based analyses for phage discovery.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna
Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. For example, despite the tens of thousands of genes in the human genome, the principled study of genomics is fruitful because biological processes rely on coordinated organization that results in lower dimensional phenotypes. To uncover this organization, many nonlinear dimensionality reduction techniques have successfully embedded high-dimensional data into low-dimensional spaces by preserving local similarities between data points. However, the nonlinearities in these methods allow for too much curvature to preserve general trends across multiple non-neighboring data clusters, thereby limiting their interpretability and generalizability to out-of-distribution data. Here, we address both of these limitations by regularizing the curvature of manifolds generated by variational autoencoders, a process we coin ``$Gamma$-VAE''. We demonstrate its utility using two example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage tracing experiment in hematopoietic stem cell differentiation. We find that the resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them. Finally, we show that preserving long-range relationships to differentiated cells separates undifferentiated cells -- which have not yet specialized -- according to their eventual fate. Broadly, we anticipate that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models in any high-dimensional system with emergent low-dimensional behavior.
{"title":"$Γ$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data","authors":"Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna","doi":"arxiv-2403.01078","DOIUrl":"https://doi.org/arxiv-2403.01078","url":null,"abstract":"Natural systems with emergent behaviors often organize along low-dimensional\u0000subsets of high-dimensional spaces. For example, despite the tens of thousands\u0000of genes in the human genome, the principled study of genomics is fruitful\u0000because biological processes rely on coordinated organization that results in\u0000lower dimensional phenotypes. To uncover this organization, many nonlinear\u0000dimensionality reduction techniques have successfully embedded high-dimensional\u0000data into low-dimensional spaces by preserving local similarities between data\u0000points. However, the nonlinearities in these methods allow for too much\u0000curvature to preserve general trends across multiple non-neighboring data\u0000clusters, thereby limiting their interpretability and generalizability to\u0000out-of-distribution data. Here, we address both of these limitations by\u0000regularizing the curvature of manifolds generated by variational autoencoders,\u0000a process we coin ``$Gamma$-VAE''. We demonstrate its utility using two\u0000example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the\u0000Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage\u0000tracing experiment in hematopoietic stem cell differentiation. We find that the\u0000resulting regularized manifolds identify mesoscale structure associated with\u0000different cancer cell types, and accurately re-embed tissues from completely\u0000unseen, out-of distribution cancers as if they were originally trained on them.\u0000Finally, we show that preserving long-range relationships to differentiated\u0000cells separates undifferentiated cells -- which have not yet specialized --\u0000according to their eventual fate. Broadly, we anticipate that regularizing the\u0000curvature of generative models will enable more consistent, predictive, and\u0000generalizable models in any high-dimensional system with emergent\u0000low-dimensional behavior.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140035804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abanti Bhattacharjya, Md Manowarul Islam, Md Ashraf Uddin, Md. Alamin Talukder, AKM Azad, Sunil Aryal, Bikash Kumar Paul, Wahia Tasnim, Muhammad Ali Abdulllah Almoyad, Mohammad Ali Moni
With the advent of Information technology, the Bioinformatics research field is becoming increasingly attractive to researchers and academicians. The recent development of various Bioinformatics toolkits has facilitated the rapid processing and analysis of vast quantities of biological data for human perception. Most studies focus on locating two connected diseases and making some observations to construct diverse gene regulatory interaction networks, a forerunner to general drug design for curing illness. For instance, Hypopharyngeal cancer is a disease that is associated with EGFR-mutated lung adenocarcinoma. In this study, we select EGFR-mutated lung adenocarcinoma and Hypopharyngeal cancer by finding the Lung metastases in hypopharyngeal cancer. To conduct this study, we collect Mircorarray datasets from GEO (Gene Expression Omnibus), an online database controlled by NCBI. Differentially expressed genes, common genes, and hub genes between the selected two diseases are detected for the succeeding move. Our research findings have suggested common therapeutic molecules for the selected diseases based on 10 hub genes with the highest interactions according to the degree topology method and the maximum clique centrality (MCC). Our suggested therapeutic molecules will be fruitful for patients with those two diseases simultaneously.
{"title":"Exploring Gene Regulatory Interaction Networks and predicting therapeutic molecules for Hypopharyngeal Cancer and EGFR-mutated lung adenocarcinoma","authors":"Abanti Bhattacharjya, Md Manowarul Islam, Md Ashraf Uddin, Md. Alamin Talukder, AKM Azad, Sunil Aryal, Bikash Kumar Paul, Wahia Tasnim, Muhammad Ali Abdulllah Almoyad, Mohammad Ali Moni","doi":"arxiv-2402.17807","DOIUrl":"https://doi.org/arxiv-2402.17807","url":null,"abstract":"With the advent of Information technology, the Bioinformatics research field\u0000is becoming increasingly attractive to researchers and academicians. The recent\u0000development of various Bioinformatics toolkits has facilitated the rapid\u0000processing and analysis of vast quantities of biological data for human\u0000perception. Most studies focus on locating two connected diseases and making\u0000some observations to construct diverse gene regulatory interaction networks, a\u0000forerunner to general drug design for curing illness. For instance,\u0000Hypopharyngeal cancer is a disease that is associated with EGFR-mutated lung\u0000adenocarcinoma. In this study, we select EGFR-mutated lung adenocarcinoma and\u0000Hypopharyngeal cancer by finding the Lung metastases in hypopharyngeal cancer.\u0000To conduct this study, we collect Mircorarray datasets from GEO (Gene\u0000Expression Omnibus), an online database controlled by NCBI. Differentially\u0000expressed genes, common genes, and hub genes between the selected two diseases\u0000are detected for the succeeding move. Our research findings have suggested\u0000common therapeutic molecules for the selected diseases based on 10 hub genes\u0000with the highest interactions according to the degree topology method and the\u0000maximum clique centrality (MCC). Our suggested therapeutic molecules will be\u0000fruitful for patients with those two diseases simultaneously.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140004060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}