Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie
Cell identity encompasses various semantic aspects of a cell, including cell type, pathway information, disease information, and more, which are essential for biologists to gain insights into its biological characteristics. Understanding cell identity from the transcriptomic data, such as annotating cell types, have become an important task in bioinformatics. As these semantic aspects are determined by human experts, it is impossible for AI models to effectively carry out cell identity understanding tasks without the supervision signals provided by single-cell and label pairs. The single-cell pre-trained language models (PLMs) currently used for this task are trained only on a single modality, transcriptomics data, lack an understanding of cell identity knowledge. As a result, they have to be fine-tuned for downstream tasks and struggle when lacking labeled data with the desired semantic labels. To address this issue, we propose an innovative solution by constructing a unified representation of single-cell data and natural language during the pre-training phase, allowing the model to directly incorporate insights related to cell identity. More specifically, we introduce textbf{LangCell}, the first textbf{Lang}uage-textbf{Cell} pre-training framework. LangCell utilizes texts enriched with cell identity information to gain a profound comprehension of cross-modal knowledge. Results from experiments conducted on different benchmarks show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.
{"title":"LangCell: Language-Cell Pre-training for Cell Identity Understanding","authors":"Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie","doi":"arxiv-2405.06708","DOIUrl":"https://doi.org/arxiv-2405.06708","url":null,"abstract":"Cell identity encompasses various semantic aspects of a cell, including cell\u0000type, pathway information, disease information, and more, which are essential\u0000for biologists to gain insights into its biological characteristics.\u0000Understanding cell identity from the transcriptomic data, such as annotating\u0000cell types, have become an important task in bioinformatics. As these semantic\u0000aspects are determined by human experts, it is impossible for AI models to\u0000effectively carry out cell identity understanding tasks without the supervision\u0000signals provided by single-cell and label pairs. The single-cell pre-trained\u0000language models (PLMs) currently used for this task are trained only on a\u0000single modality, transcriptomics data, lack an understanding of cell identity\u0000knowledge. As a result, they have to be fine-tuned for downstream tasks and\u0000struggle when lacking labeled data with the desired semantic labels. To address\u0000this issue, we propose an innovative solution by constructing a unified\u0000representation of single-cell data and natural language during the pre-training\u0000phase, allowing the model to directly incorporate insights related to cell\u0000identity. More specifically, we introduce textbf{LangCell}, the first\u0000textbf{Lang}uage-textbf{Cell} pre-training framework. LangCell utilizes texts\u0000enriched with cell identity information to gain a profound comprehension of\u0000cross-modal knowledge. Results from experiments conducted on different\u0000benchmarks show that LangCell is the only single-cell PLM that can work\u0000effectively in zero-shot cell identity understanding scenarios, and also\u0000significantly outperforms existing models in few-shot and fine-tuning cell\u0000identity understanding scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"189 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhufeng Li, Sandeep S Cranganore, Nicholas Youngblut, Niki Kilbertus
Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.
{"title":"Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity","authors":"Zhufeng Li, Sandeep S Cranganore, Nicholas Youngblut, Niki Kilbertus","doi":"arxiv-2405.05998","DOIUrl":"https://doi.org/arxiv-2405.05998","url":null,"abstract":"Leveraging the vast genetic diversity within microbiomes offers unparalleled\u0000insights into complex phenotypes, yet the task of accurately predicting and\u0000understanding such traits from genomic data remains challenging. We propose a\u0000framework taking advantage of existing large models for gene vectorization to\u0000predict habitat specificity from entire microbial genome sequences. Based on\u0000our model, we develop attribution techniques to elucidate gene interaction\u0000effects that drive microbial adaptation to diverse environments. We train and\u0000validate our approach on a large dataset of high quality microbiome genomes\u0000from different habitats. We not only demonstrate solid predictive performance,\u0000but also how sequence-level information of entire genomes allows us to identify\u0000gene associations underlying complex phenotypes. Our attribution recovers known\u0000important interaction networks and proposes new candidates for experimental\u0000follow up.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigate the information-theoretic conditions to achieve the complete reconstruction of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based algorithms and compare the coverage depth and read length requirements with the information-theoretic lower bound. Our results show that the gap between the two is considerable because both algorithms require the double repeats in the genome to be bridged.
{"title":"On the Coverage Required for Diploid Genome Assembly","authors":"Daanish Mahajan, Chirag Jain, Navin Kashyap","doi":"arxiv-2405.05734","DOIUrl":"https://doi.org/arxiv-2405.05734","url":null,"abstract":"We investigate the information-theoretic conditions to achieve the complete\u0000reconstruction of a diploid genome. We also analyze the standard greedy and\u0000de-Bruijn graph-based algorithms and compare the coverage depth and read length\u0000requirements with the information-theoretic lower bound. Our results show that\u0000the gap between the two is considerable because both algorithms require the\u0000double repeats in the genome to be bridged.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erin E. Gill, Baofeng Jia, Carmen Lia Murall, Raphaël Poujol, Muhammad Zohaib Anwar, Nithu Sara John, Justin Richardsson, Ashley Hobb, Abayomi S. Olabode, Alexandru Lepsa, Ana T. Duggan, Andrea D. Tyler, Arnaud N'Guessan, Atul Kachru, Brandon Chan, Catherine Yoshida, Christina K. Yung, David Bujold, Dusan Andric, Edmund Su, Emma J. Griffiths, Gary Van Domselaar, Gordon W. Jolly, Heather K. E. Ward, Henrich Feher, Jared Baker, Jared T. Simpson, Jaser Uddin, Jiannis Ragoussis, Jon Eubank, Jörg H. Fritz, José Héctor Gálvez, Karen Fang, Kim Cullion, Leonardo Rivera, Linda Xiang, Matthew A. Croxen, Mitchell Shiell, Natalie Prystajecky, Pierre-Olivier Quirion, Rosita Bajari, Samantha Rich, Samira Mubareka, Sandrine Moreira, Scott Cain, Steven G. Sutcliffe, Susanne A. Kraemer, Yann Joly, Yelizar Alturmessov, CPHLN consortium, CanCOGeN consortium, VirusSeq Data Portal Academic, Health network, Marc Fiume, Terrance P. Snutch, Cindy Bell, Catalina Lopez-Correa, Julie G. Hussin, Jeffrey B. Joy, Caroline Colijn, Paul M. K. Gordon, William W. L. Hsiao, Art F. Y. Poon, Natalie C. Knox, Mélanie Courtot, Lincoln Stein, Sarah P. Otto, Guillaume Bourque, B. Jesse Shapiro, Fiona S. L. Brinkman
The COVID-19 pandemic led to a large global effort to sequence SARS-CoV-2 genomes from patient samples to track viral evolution and inform public health response. Millions of SARS-CoV-2 genome sequences have been deposited in global public repositories. The Canadian COVID-19 Genomics Network (CanCOGeN - VirusSeq), a consortium tasked with coordinating expanded sequencing of SARS-CoV-2 genomes across Canada early in the pandemic, created the Canadian VirusSeq Data Portal, with associated data pipelines and procedures, to support these efforts. The goal of VirusSeq was to allow open access to Canadian SARS-CoV-2 genomic sequences and enhanced, standardized contextual data that were unavailable in other repositories and that meet FAIR standards (Findable, Accessible, Interoperable and Reusable). The Portal data submission pipeline contains data quality checking procedures and appropriate acknowledgement of data generators that encourages collaboration. Here we also highlight Duotang, a web platform that presents genomic epidemiology and modeling analyses on circulating and emerging SARS-CoV-2 variants in Canada. Duotang presents dynamic changes in variant composition of SARS-CoV-2 in Canada and by province, estimates variant growth, and displays complementary interactive visualizations, with a text overview of the current situation. The VirusSeq Data Portal and Duotang resources, alongside additional analyses and resources computed from the Portal (COVID-MVP, CoVizu), are all open-source and freely available. Together, they provide an updated picture of SARS-CoV-2 evolution to spur scientific discussions, inform public discourse, and support communication with and within public health authorities. They also serve as a framework for other jurisdictions interested in open, collaborative sequence data sharing and analyses.
{"title":"The Canadian VirusSeq Data Portal & Duotang: open resources for SARS-CoV-2 viral sequences and genomic epidemiology","authors":"Erin E. Gill, Baofeng Jia, Carmen Lia Murall, Raphaël Poujol, Muhammad Zohaib Anwar, Nithu Sara John, Justin Richardsson, Ashley Hobb, Abayomi S. Olabode, Alexandru Lepsa, Ana T. Duggan, Andrea D. Tyler, Arnaud N'Guessan, Atul Kachru, Brandon Chan, Catherine Yoshida, Christina K. Yung, David Bujold, Dusan Andric, Edmund Su, Emma J. Griffiths, Gary Van Domselaar, Gordon W. Jolly, Heather K. E. Ward, Henrich Feher, Jared Baker, Jared T. Simpson, Jaser Uddin, Jiannis Ragoussis, Jon Eubank, Jörg H. Fritz, José Héctor Gálvez, Karen Fang, Kim Cullion, Leonardo Rivera, Linda Xiang, Matthew A. Croxen, Mitchell Shiell, Natalie Prystajecky, Pierre-Olivier Quirion, Rosita Bajari, Samantha Rich, Samira Mubareka, Sandrine Moreira, Scott Cain, Steven G. Sutcliffe, Susanne A. Kraemer, Yann Joly, Yelizar Alturmessov, CPHLN consortium, CanCOGeN consortium, VirusSeq Data Portal Academic, Health network, Marc Fiume, Terrance P. Snutch, Cindy Bell, Catalina Lopez-Correa, Julie G. Hussin, Jeffrey B. Joy, Caroline Colijn, Paul M. K. Gordon, William W. L. Hsiao, Art F. Y. Poon, Natalie C. Knox, Mélanie Courtot, Lincoln Stein, Sarah P. Otto, Guillaume Bourque, B. Jesse Shapiro, Fiona S. L. Brinkman","doi":"arxiv-2405.04734","DOIUrl":"https://doi.org/arxiv-2405.04734","url":null,"abstract":"The COVID-19 pandemic led to a large global effort to sequence SARS-CoV-2\u0000genomes from patient samples to track viral evolution and inform public health\u0000response. Millions of SARS-CoV-2 genome sequences have been deposited in global\u0000public repositories. The Canadian COVID-19 Genomics Network (CanCOGeN -\u0000VirusSeq), a consortium tasked with coordinating expanded sequencing of\u0000SARS-CoV-2 genomes across Canada early in the pandemic, created the Canadian\u0000VirusSeq Data Portal, with associated data pipelines and procedures, to support\u0000these efforts. The goal of VirusSeq was to allow open access to Canadian\u0000SARS-CoV-2 genomic sequences and enhanced, standardized contextual data that\u0000were unavailable in other repositories and that meet FAIR standards (Findable,\u0000Accessible, Interoperable and Reusable). The Portal data submission pipeline\u0000contains data quality checking procedures and appropriate acknowledgement of\u0000data generators that encourages collaboration. Here we also highlight Duotang,\u0000a web platform that presents genomic epidemiology and modeling analyses on\u0000circulating and emerging SARS-CoV-2 variants in Canada. Duotang presents\u0000dynamic changes in variant composition of SARS-CoV-2 in Canada and by province,\u0000estimates variant growth, and displays complementary interactive\u0000visualizations, with a text overview of the current situation. The VirusSeq\u0000Data Portal and Duotang resources, alongside additional analyses and resources\u0000computed from the Portal (COVID-MVP, CoVizu), are all open-source and freely\u0000available. Together, they provide an updated picture of SARS-CoV-2 evolution to\u0000spur scientific discussions, inform public discourse, and support communication\u0000with and within public health authorities. They also serve as a framework for\u0000other jurisdictions interested in open, collaborative sequence data sharing and\u0000analyses.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan
Influenced by breakthroughs in LLMs, single-cell foundation models are emerging. While these models show successful performance in cell type clustering, phenotype classification, and gene perturbation response prediction, it remains to be seen if a simpler model could achieve comparable or better results, especially with limited data. This is important, as the quantity and quality of single-cell data typically fall short of the standards in textual data used for training LLMs. Single-cell sequencing often suffers from technical artifacts, dropout events, and batch effects. These challenges are compounded in a weakly supervised setting, where the labels of cell states can be noisy, further complicating the analysis. To tackle these challenges, we present sc-OTGM, streamlined with less than 500K parameters, making it approximately 100x more compact than the foundation models, offering an efficient alternative. sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated from a combination of the finite multivariate Gaussian distributions. The core function of sc-OTGM is to create a probabilistic latent space utilizing a GMM as its prior distribution and distinguish between distinct cell populations by learning their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to determine the OT plan across these PDFs within the GMM framework. We evaluated our model against a CRISPR-mediated perturbation dataset, called CROP-seq, consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification through a recommender system. It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states.
{"title":"sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures","authors":"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan","doi":"arxiv-2405.03726","DOIUrl":"https://doi.org/arxiv-2405.03726","url":null,"abstract":"Influenced by breakthroughs in LLMs, single-cell foundation models are\u0000emerging. While these models show successful performance in cell type\u0000clustering, phenotype classification, and gene perturbation response\u0000prediction, it remains to be seen if a simpler model could achieve comparable\u0000or better results, especially with limited data. This is important, as the\u0000quantity and quality of single-cell data typically fall short of the standards\u0000in textual data used for training LLMs. Single-cell sequencing often suffers\u0000from technical artifacts, dropout events, and batch effects. These challenges\u0000are compounded in a weakly supervised setting, where the labels of cell states\u0000can be noisy, further complicating the analysis. To tackle these challenges, we\u0000present sc-OTGM, streamlined with less than 500K parameters, making it\u0000approximately 100x more compact than the foundation models, offering an\u0000efficient alternative. sc-OTGM is an unsupervised model grounded in the\u0000inductive bias that the scRNAseq data can be generated from a combination of\u0000the finite multivariate Gaussian distributions. The core function of sc-OTGM is\u0000to create a probabilistic latent space utilizing a GMM as its prior\u0000distribution and distinguish between distinct cell populations by learning\u0000their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\u0000determine the OT plan across these PDFs within the GMM framework. We evaluated\u0000our model against a CRISPR-mediated perturbation dataset, called CROP-seq,\u0000consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\u0000is effective in cell state classification, aids in the analysis of differential\u0000gene expression, and ranks genes for target identification through a\u0000recommender system. It also predicts the effects of single-gene perturbations\u0000on downstream gene regulation and generates synthetic scRNA-seq data\u0000conditioned on specific cell states.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper a multi-domain multi-task algorithm for feature selection in bulk RNAseq data is proposed. Two datasets are investigated arising from mouse host immune response to Salmonella infection. Data is collected from several strains of collaborative cross mice. Samples from the spleen and liver serve as the two domains. Several machine learning experiments are conducted and the small subset of discriminative across domains features have been extracted in each case. The algorithm proves viable and underlines the benefits of across domain feature selection by extracting new subset of discriminative features which couldn't be extracted only by one-domain approach.
{"title":"A Multi-Domain Multi-Task Approach for Feature Selection from Bulk RNA Datasets","authors":"Karim Salta, Tomojit Ghosh, Michael Kirby","doi":"arxiv-2405.02534","DOIUrl":"https://doi.org/arxiv-2405.02534","url":null,"abstract":"In this paper a multi-domain multi-task algorithm for feature selection in\u0000bulk RNAseq data is proposed. Two datasets are investigated arising from mouse\u0000host immune response to Salmonella infection. Data is collected from several\u0000strains of collaborative cross mice. Samples from the spleen and liver serve as\u0000the two domains. Several machine learning experiments are conducted and the\u0000small subset of discriminative across domains features have been extracted in\u0000each case. The algorithm proves viable and underlines the benefits of across\u0000domain feature selection by extracting new subset of discriminative features\u0000which couldn't be extracted only by one-domain approach.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes
Advances in high throughput sequencing technologies provide a large number of genomes to be analyzed, so computational methodologies play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations, however, this approach can be computationally expensive and potentially arbitrary in scenarios with large datasets. Here, we present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This method uses the principle of maximum entropy to select the most informative k-mers specific to the variant under investigation. The use of this informative k-mer set enables the detection of variant-specific mutations in comparison to a reference sequence. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of real viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to the gold-standard statistical tools. The source code for this proof-of-concept implementation is freely available at https://github.com/omatheuspimenta/GRAMEP.
{"title":"Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy","authors":"Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes","doi":"arxiv-2405.01715","DOIUrl":"https://doi.org/arxiv-2405.01715","url":null,"abstract":"Advances in high throughput sequencing technologies provide a large number of\u0000genomes to be analyzed, so computational methodologies play a crucial role in\u0000analyzing and extracting knowledge from the data generated. Investigating\u0000genomic mutations is critical because of their impact on chromosomal evolution,\u0000genetic disorders, and diseases. It is common to adopt aligning sequences for\u0000analyzing genomic variations, however, this approach can be computationally\u0000expensive and potentially arbitrary in scenarios with large datasets. Here, we\u0000present a novel method for identifying single nucleotide polymorphisms (SNPs)\u0000in DNA sequences from assembled genomes. This method uses the principle of\u0000maximum entropy to select the most informative k-mers specific to the variant\u0000under investigation. The use of this informative k-mer set enables the\u0000detection of variant-specific mutations in comparison to a reference sequence.\u0000In addition, our method offers the possibility of classifying novel sequences\u0000with no need for organism-specific information. GRAMEP demonstrated high\u0000accuracy in both in silico simulations and analyses of real viral genomes,\u0000including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate\u0000SARS-CoV-2 variant identification while demonstrating a lower computational\u0000cost compared to the gold-standard statistical tools. The source code for this\u0000proof-of-concept implementation is freely available at\u0000https://github.com/omatheuspimenta/GRAMEP.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jayoung Ryu, Romain Lopez, Charlotte Bunne, Aviv Regev
It is now possible to conduct large scale perturbation screens with complex readout modalities, such as different molecular profiles or high content cell images. While these open the way for systematic dissection of causal cell circuits, integrated such data across screens to maximize our ability to predict circuits poses substantial computational challenges, which have not been addressed. Here, we extend two Gromov-Wasserstein Optimal Transport methods to incorporate the perturbation label for cross-modality alignment. The obtained alignment is then employed to train a predictive model that estimates cellular responses to perturbations observed with only one measurement modality. We validate our method for the tasks of cross-modality alignment and cross-modality prediction in a recent multi-modal single-cell perturbation dataset. Our approach opens the way to unified causal models of cell biology.
{"title":"Cross-modality Matching and Prediction of Perturbation Responses with Labeled Gromov-Wasserstein Optimal Transport","authors":"Jayoung Ryu, Romain Lopez, Charlotte Bunne, Aviv Regev","doi":"arxiv-2405.00838","DOIUrl":"https://doi.org/arxiv-2405.00838","url":null,"abstract":"It is now possible to conduct large scale perturbation screens with complex\u0000readout modalities, such as different molecular profiles or high content cell\u0000images. While these open the way for systematic dissection of causal cell\u0000circuits, integrated such data across screens to maximize our ability to\u0000predict circuits poses substantial computational challenges, which have not\u0000been addressed. Here, we extend two Gromov-Wasserstein Optimal Transport\u0000methods to incorporate the perturbation label for cross-modality alignment. The\u0000obtained alignment is then employed to train a predictive model that estimates\u0000cellular responses to perturbations observed with only one measurement\u0000modality. We validate our method for the tasks of cross-modality alignment and\u0000cross-modality prediction in a recent multi-modal single-cell perturbation\u0000dataset. Our approach opens the way to unified causal models of cell biology.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Caroline C. McGrouther, Aaditya V. Rangan, Arianna Di Florio, Jeremy A. Elman, Nicholas J. Schork, John Kelsoe
Bipolar disorder is a highly heritable brain disorder which affects an estimated 50 million people worldwide. Due to recent advances in genotyping technology and bioinformatics methodology, as well as the increase in the overall amount of available data, our understanding of the genetic underpinnings of BD has improved. A growing consensus is that BD is polygenic and heterogeneous, but the specifics of that heterogeneity are not yet well understood. Here we use a recently developed technique to investigate the genetic heterogeneity of bipolar disorder. We find strong statistical evidence for a `bicluster': a subset of bipolar subjects that exhibits a disease-specific genetic pattern. The structure illuminated by this bicluster replicates in several other data-sets and can be used to improve BD risk-prediction algorithms. We believe that this bicluster is likely to correspond to a genetically-distinct subtype of BD. More generally, we believe that our biclustering approach is a promising means of untangling the underlying heterogeneity of complex disease without the need for reliable subphenotypic data.
{"title":"Heterogeneity analysis provides evidence for a genetically homogeneous subtype of bipolar-disorder","authors":"Caroline C. McGrouther, Aaditya V. Rangan, Arianna Di Florio, Jeremy A. Elman, Nicholas J. Schork, John Kelsoe","doi":"arxiv-2405.00159","DOIUrl":"https://doi.org/arxiv-2405.00159","url":null,"abstract":"Bipolar disorder is a highly heritable brain disorder which affects an\u0000estimated 50 million people worldwide. Due to recent advances in genotyping\u0000technology and bioinformatics methodology, as well as the increase in the\u0000overall amount of available data, our understanding of the genetic\u0000underpinnings of BD has improved. A growing consensus is that BD is polygenic\u0000and heterogeneous, but the specifics of that heterogeneity are not yet well\u0000understood. Here we use a recently developed technique to investigate the\u0000genetic heterogeneity of bipolar disorder. We find strong statistical evidence\u0000for a `bicluster': a subset of bipolar subjects that exhibits a\u0000disease-specific genetic pattern. The structure illuminated by this bicluster\u0000replicates in several other data-sets and can be used to improve BD\u0000risk-prediction algorithms. We believe that this bicluster is likely to\u0000correspond to a genetically-distinct subtype of BD. More generally, we believe\u0000that our biclustering approach is a promising means of untangling the\u0000underlying heterogeneity of complex disease without the need for reliable\u0000subphenotypic data.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Gaudelet, Alice Del Vecchio, Eli M Carrami, Juliana Cudini, Chantriolnt-Andreas Kapourani, Caroline Uhler, Lindsay Edwards
Interventions play a pivotal role in the study of complex biological systems. In drug discovery, genetic interventions (such as CRISPR base editing) have become central to both identifying potential therapeutic targets and understanding a drug's mechanism of action. With the advancement of CRISPR and the proliferation of genome-scale analyses such as transcriptomics, a new challenge is to navigate the vast combinatorial space of concurrent genetic interventions. Addressing this, our work concentrates on estimating the effects of pairwise genetic combinations on the cellular transcriptome. We introduce two novel contributions: Salt, a biologically-inspired baseline that posits the mostly additive nature of combination effects, and Peper, a deep learning model that extends Salt's additive assumption to achieve unprecedented accuracy. Our comprehensive comparison against existing state-of-the-art methods, grounded in diverse metrics, and our out-of-distribution analysis highlight the limitations of current models in realistic settings. This analysis underscores the necessity for improved modelling techniques and data acquisition strategies, paving the way for more effective exploration of genetic intervention effects.
{"title":"Season combinatorial intervention predictions with Salt & Peper","authors":"Thomas Gaudelet, Alice Del Vecchio, Eli M Carrami, Juliana Cudini, Chantriolnt-Andreas Kapourani, Caroline Uhler, Lindsay Edwards","doi":"arxiv-2404.16907","DOIUrl":"https://doi.org/arxiv-2404.16907","url":null,"abstract":"Interventions play a pivotal role in the study of complex biological systems.\u0000In drug discovery, genetic interventions (such as CRISPR base editing) have\u0000become central to both identifying potential therapeutic targets and\u0000understanding a drug's mechanism of action. With the advancement of CRISPR and\u0000the proliferation of genome-scale analyses such as transcriptomics, a new\u0000challenge is to navigate the vast combinatorial space of concurrent genetic\u0000interventions. Addressing this, our work concentrates on estimating the effects\u0000of pairwise genetic combinations on the cellular transcriptome. We introduce\u0000two novel contributions: Salt, a biologically-inspired baseline that posits the\u0000mostly additive nature of combination effects, and Peper, a deep learning model\u0000that extends Salt's additive assumption to achieve unprecedented accuracy. Our\u0000comprehensive comparison against existing state-of-the-art methods, grounded in\u0000diverse metrics, and our out-of-distribution analysis highlight the limitations\u0000of current models in realistic settings. This analysis underscores the\u0000necessity for improved modelling techniques and data acquisition strategies,\u0000paving the way for more effective exploration of genetic intervention effects.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}