Pub Date : 2024-11-23DOI: 10.1093/bioinformatics/btae704
Cheng Peng, Jiayu Shang, Jiaojiao Guan, Donglin Wang, Yanni Sun
Motivation: Viruses, with their ubiquitous presence and high diversity, play pivotal roles in ecological systems and public health. Accurate identification of viruses in various ecosystems is essential for comprehending their variety and assessing their ecological influence. Metagenomic sequencing has become a major strategy to survey the viruses in various ecosystems. However, accurate and comprehensive virus detection in metagenomic data remains difficult. Limited reference sequences prevent alignment-based methods from identifying novel viruses. Machine learning-based tools are more promising in novel virus detection but often miss short viral contigs, which are abundant in typical metagenomic data. The inconsistency in virus search results produced by available tools further highlights the urgent need for a more robust tool for virus identification.
Results: In this work, we develop ViraLM for identifying novel viral contigs in metagenomic data. By employing the latest genome foundation model as the backbone and training on a rigorously constructed dataset, the model is able to distinguish viruses from other organisms based on the learned genomic characteristics. We thoroughly tested ViraLM on multiple datasets and the experimental results show that ViraLM outperforms available tools in different scenarios. In particular, ViraLM improves the F1-score on short contigs by 22%.
Availability: The source code of ViraLM is available via: https://github.com/ChengPENG-wolf/ViraLM.
{"title":"ViraLM: Empowering Virus Discovery through the Genome Foundation Model.","authors":"Cheng Peng, Jiayu Shang, Jiaojiao Guan, Donglin Wang, Yanni Sun","doi":"10.1093/bioinformatics/btae704","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae704","url":null,"abstract":"<p><strong>Motivation: </strong>Viruses, with their ubiquitous presence and high diversity, play pivotal roles in ecological systems and public health. Accurate identification of viruses in various ecosystems is essential for comprehending their variety and assessing their ecological influence. Metagenomic sequencing has become a major strategy to survey the viruses in various ecosystems. However, accurate and comprehensive virus detection in metagenomic data remains difficult. Limited reference sequences prevent alignment-based methods from identifying novel viruses. Machine learning-based tools are more promising in novel virus detection but often miss short viral contigs, which are abundant in typical metagenomic data. The inconsistency in virus search results produced by available tools further highlights the urgent need for a more robust tool for virus identification.</p><p><strong>Results: </strong>In this work, we develop ViraLM for identifying novel viral contigs in metagenomic data. By employing the latest genome foundation model as the backbone and training on a rigorously constructed dataset, the model is able to distinguish viruses from other organisms based on the learned genomic characteristics. We thoroughly tested ViraLM on multiple datasets and the experimental results show that ViraLM outperforms available tools in different scenarios. In particular, ViraLM improves the F1-score on short contigs by 22%.</p><p><strong>Availability: </strong>The source code of ViraLM is available via: https://github.com/ChengPENG-wolf/ViraLM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142696061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-23DOI: 10.1093/bioinformatics/btae669
Rosario Astaburuaga-García, Thomas Sell, Samet Mutlu, Anja Sieber, Kirsten Lauber, Nils Blüthgen
Motivation: High dimensional single-cell mass cytometry data are confounded by unwanted covariance due to variations in cell size and staining efficiency, making analysis and interpretation challenging.
Results: We present RUCova, a novel method designed to address confounding factors in mass cytometry data. RUCova removes unwanted covariance from measured markers applying multivariate linear regression based on Surrogates of sources Unwanted Covariance (SUCs) and principal component analysis (PCA). We exemplify the use of RUCova and show that it effectively removes unwanted covariance while preserving genuine biological signals. Our results demonstrate the efficacy of RUCova in elucidating complex data patterns, facilitating the identification of activated signalling pathways, and improving the classification of important cell populations such as apoptotic cells. By providing a robust framework for data normalization and interpretation, RUCova enhances the accuracy and reliability of mass cytometry analyses, contributing to advances in our understanding of cellular biology and disease mechanisms.
Availability and implementation: The R package is available on https://github.com/molsysbio/RUCova. Detailed documentation, data, and the code required to reproduce the results are available on https://doi.org/10.5281/zenodo.10913464.
Supplementary information: Available at Bioinformatics online (PDF).
{"title":"RUCova: Removal of Unwanted Covariance in mass cytometry data.","authors":"Rosario Astaburuaga-García, Thomas Sell, Samet Mutlu, Anja Sieber, Kirsten Lauber, Nils Blüthgen","doi":"10.1093/bioinformatics/btae669","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae669","url":null,"abstract":"<p><strong>Motivation: </strong>High dimensional single-cell mass cytometry data are confounded by unwanted covariance due to variations in cell size and staining efficiency, making analysis and interpretation challenging.</p><p><strong>Results: </strong>We present RUCova, a novel method designed to address confounding factors in mass cytometry data. RUCova removes unwanted covariance from measured markers applying multivariate linear regression based on Surrogates of sources Unwanted Covariance (SUCs) and principal component analysis (PCA). We exemplify the use of RUCova and show that it effectively removes unwanted covariance while preserving genuine biological signals. Our results demonstrate the efficacy of RUCova in elucidating complex data patterns, facilitating the identification of activated signalling pathways, and improving the classification of important cell populations such as apoptotic cells. By providing a robust framework for data normalization and interpretation, RUCova enhances the accuracy and reliability of mass cytometry analyses, contributing to advances in our understanding of cellular biology and disease mechanisms.</p><p><strong>Availability and implementation: </strong>The R package is available on https://github.com/molsysbio/RUCova. Detailed documentation, data, and the code required to reproduce the results are available on https://doi.org/10.5281/zenodo.10913464.</p><p><strong>Supplementary information: </strong>Available at Bioinformatics online (PDF).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142696059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22DOI: 10.1093/bioinformatics/btae676
Di Liu, Yina Wei
Summary: As brain imaging and neurofeedback technologies advance, the brain-to-brain interface (BBI) has emerged as an innovative filed, enabling in-depth exploration of cross-brain information exchange and enhancing our understanding of collaborative intelligence. However, no open-source virtual reality (VR) platform currently supports the rapid and efficient configuration of multi-user, collaborative BBIs. To address this gap, we introduce the Collaborative Virtual Reality Brain-to-Brain Interface (CVR-BBI), an open-source platform consisting of a client and server. The CVR-BBI client enables users to participate in collaborative experiments, collect electroencephalogram (EEG) data and manage interactive multisensory stimuli within the VR environment. Meanwhile, the CVR-BBI server manages multi-user collaboration paradigms, and performs real-time analysis of the EEG data. We evaluated the CVR-BBI platform using the SSVEP paradigm and observed that collaborative decoding outperformed individual decoding, validating the platform's effectiveness in collaborative settings. The CVR-BBI offers a pioneering platform that facilitates the development of innovative BBI applications within collaborative VR environments, thereby enhancing the understanding of brain collaboration and cognition.
Availability and implementation: CVR-BBI is released as an open-source platform, with its source code being available at https://github.com/DILIU1/CVR-BBI.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"CVR-BBI: An Open-Source VR Platform for Multi-User Collaborative Brain to Brain Interfaces.","authors":"Di Liu, Yina Wei","doi":"10.1093/bioinformatics/btae676","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae676","url":null,"abstract":"<p><strong>Summary: </strong>As brain imaging and neurofeedback technologies advance, the brain-to-brain interface (BBI) has emerged as an innovative filed, enabling in-depth exploration of cross-brain information exchange and enhancing our understanding of collaborative intelligence. However, no open-source virtual reality (VR) platform currently supports the rapid and efficient configuration of multi-user, collaborative BBIs. To address this gap, we introduce the Collaborative Virtual Reality Brain-to-Brain Interface (CVR-BBI), an open-source platform consisting of a client and server. The CVR-BBI client enables users to participate in collaborative experiments, collect electroencephalogram (EEG) data and manage interactive multisensory stimuli within the VR environment. Meanwhile, the CVR-BBI server manages multi-user collaboration paradigms, and performs real-time analysis of the EEG data. We evaluated the CVR-BBI platform using the SSVEP paradigm and observed that collaborative decoding outperformed individual decoding, validating the platform's effectiveness in collaborative settings. The CVR-BBI offers a pioneering platform that facilitates the development of innovative BBI applications within collaborative VR environments, thereby enhancing the understanding of brain collaboration and cognition.</p><p><strong>Availability and implementation: </strong>CVR-BBI is released as an open-source platform, with its source code being available at https://github.com/DILIU1/CVR-BBI.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142690014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22DOI: 10.1093/bioinformatics/btae620
Albert Garcia Lopez, Daniela Albrecht-Eckardt, Gianni Panagiotou, Sascha Schäuble
Summary: The ever-growing amount of genome-wide omics data paved the way for solving life science problems in a data-driven manner. Among others, enrichment analysis is part of the standard analysis arsenal to determine systemic signals in any given transcriptomic or proteomic data. Only a part of the members of the fungal kingdom, however, can be analyzed via public web applications, despite the global rise of fungal pathogens and their increasing resistance to antimycotics. We present FungiFun3, a major update of our user-friendly gene set enrichment web application dedicated to fungi. FungiFun3 was rebuilt from scratch to support a modern and easy-to-use web interface and supports more than four-fold more fungal strains (n = 1,287 in total) than its predecessor. In addition, it also allows ranked gene set enrichment analysis at the genomic scale. FungiFun3 thus serves as a starting hub for identifying molecular signals in omics data sets related to a vast amount of available fungal strains including human fungal pathogens of the WHO's priority list and far beyond.
Availability and implementation: FungiFun3, including sample data and FAQ, is freely available at https://fungifun3.hki-jena.de/.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"FungiFun3: Systemic gene set enrichment analysis for fungal species.","authors":"Albert Garcia Lopez, Daniela Albrecht-Eckardt, Gianni Panagiotou, Sascha Schäuble","doi":"10.1093/bioinformatics/btae620","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae620","url":null,"abstract":"<p><strong>Summary: </strong>The ever-growing amount of genome-wide omics data paved the way for solving life science problems in a data-driven manner. Among others, enrichment analysis is part of the standard analysis arsenal to determine systemic signals in any given transcriptomic or proteomic data. Only a part of the members of the fungal kingdom, however, can be analyzed via public web applications, despite the global rise of fungal pathogens and their increasing resistance to antimycotics. We present FungiFun3, a major update of our user-friendly gene set enrichment web application dedicated to fungi. FungiFun3 was rebuilt from scratch to support a modern and easy-to-use web interface and supports more than four-fold more fungal strains (n = 1,287 in total) than its predecessor. In addition, it also allows ranked gene set enrichment analysis at the genomic scale. FungiFun3 thus serves as a starting hub for identifying molecular signals in omics data sets related to a vast amount of available fungal strains including human fungal pathogens of the WHO's priority list and far beyond.</p><p><strong>Availability and implementation: </strong>FungiFun3, including sample data and FAQ, is freely available at https://fungifun3.hki-jena.de/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142690021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22DOI: 10.1093/bioinformatics/btae621
Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine
Motivation: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.
Results: To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark(217 multiplex assays of variant effect- MAVE- with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).
Availability: VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction.","authors":"Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine","doi":"10.1093/bioinformatics/btae621","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae621","url":null,"abstract":"<p><strong>Motivation: </strong>Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.</p><p><strong>Results: </strong>To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark(217 multiplex assays of variant effect- MAVE- with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).</p><p><strong>Availability: </strong>VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142690017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-21DOI: 10.1093/bioinformatics/btae699
Rakbin Sung, Hyeonkyu Kim, Junil Kim, Daewon Lee
Summary: TENET reconstructs gene regulatory networks from single-cell RNA sequencing (scRNAseq) data using the transfer entropy, and works successfully on a variety of scRNAseq data. However, TENET is limited by its long computation time for large datasets. To address this limitation, we propose FastTENET, an array-computing version of TENET algorithm optimized for acceleration on manycore processors such as GPUs. FastTENET counts the unique patterns of joint events to compute the transfer entropy based on array computing. Compared to TENET, FastTENET achieves up to 973× performance improvement.
Availability and implementation: FastTENET is available on GitHub at https://github.com/cxinsys/fasttenet.
Supplementary information: Supplementary data is available at Bioinformatics online.
{"title":"FastTENET: an accelerated TENET algorithm based on manycore computing in Python.","authors":"Rakbin Sung, Hyeonkyu Kim, Junil Kim, Daewon Lee","doi":"10.1093/bioinformatics/btae699","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae699","url":null,"abstract":"<p><strong>Summary: </strong>TENET reconstructs gene regulatory networks from single-cell RNA sequencing (scRNAseq) data using the transfer entropy, and works successfully on a variety of scRNAseq data. However, TENET is limited by its long computation time for large datasets. To address this limitation, we propose FastTENET, an array-computing version of TENET algorithm optimized for acceleration on manycore processors such as GPUs. FastTENET counts the unique patterns of joint events to compute the transfer entropy based on array computing. Compared to TENET, FastTENET achieves up to 973× performance improvement.</p><p><strong>Availability and implementation: </strong>FastTENET is available on GitHub at https://github.com/cxinsys/fasttenet.</p><p><strong>Supplementary information: </strong>Supplementary data is available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-21DOI: 10.1093/bioinformatics/btae675
Yu-Xiang Huang, Rong Liu
Motivation: Post-translational modification (PTM) crosstalk events play critical roles in biological processes. Several machine learning methods have been developed to identify PTM crosstalk within proteins, but the accuracy is still far from satisfactory. Recent breakthroughs in deep learning and protein structure prediction could provide a potential solution to this issue.
Results: We proposed DeepPCT, a deep learning algorithm to identify PTM crosstalk using AlphaFold2-based structures. In this algorithm, one deep learning classifier was constructed for sequence-based prediction by combining the residue and residue pair embeddings with cross-attention techniques, while the other classifier was established for structure-based prediction by integrating the structural embedding and a graph neural network. Meanwhile, a machine learning classifier was developed using novel structural descriptors and a random forest model to complement the structural deep learning classifier. By integrating the three classifiers, DeepPCT outperformed existing algorithms in different evaluation scenarios and showed better generalizability on new data owing to its less distance dependency.
Availability: Datasets, codes, and models of DeepPCT are freely accessible at https://github.com/hzau-liulab/DeepPCT/.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Improved prediction of post-translational modification crosstalk within proteins using DeepPCT.","authors":"Yu-Xiang Huang, Rong Liu","doi":"10.1093/bioinformatics/btae675","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae675","url":null,"abstract":"<p><strong>Motivation: </strong>Post-translational modification (PTM) crosstalk events play critical roles in biological processes. Several machine learning methods have been developed to identify PTM crosstalk within proteins, but the accuracy is still far from satisfactory. Recent breakthroughs in deep learning and protein structure prediction could provide a potential solution to this issue.</p><p><strong>Results: </strong>We proposed DeepPCT, a deep learning algorithm to identify PTM crosstalk using AlphaFold2-based structures. In this algorithm, one deep learning classifier was constructed for sequence-based prediction by combining the residue and residue pair embeddings with cross-attention techniques, while the other classifier was established for structure-based prediction by integrating the structural embedding and a graph neural network. Meanwhile, a machine learning classifier was developed using novel structural descriptors and a random forest model to complement the structural deep learning classifier. By integrating the three classifiers, DeepPCT outperformed existing algorithms in different evaluation scenarios and showed better generalizability on new data owing to its less distance dependency.</p><p><strong>Availability: </strong>Datasets, codes, and models of DeepPCT are freely accessible at https://github.com/hzau-liulab/DeepPCT/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-21DOI: 10.1093/bioinformatics/btae703
Da Peng, Patrick Cahan
Motivation: Computational modelling of cell state transitions has been a great interest of many in the field of developmental biology, cancer biology and cell fate engineering because it enables performing perturbation experiments in silico more rapidly and cheaply than could be achieved in a lab. Recent advancements in single-cell RNA sequencing (scRNA-seq) allow the capture of high-resolution snapshots of cell states as they transition along temporal trajectories. Using these high-throughput datasets, we can train computational models to generate in silico 'synthetic' cells that faithfully mimic the temporal trajectories.
Results: Here we present OneSC, a platform that can simulate cell state transitions using systems of stochastic differential equations govern by a regulatory network of core transcription factors (TFs). Different from many current network inference methods, OneSC prioritizes on generating Boolean network that produces faithful cell state transitions and terminal cell states that mimic real biological systems. Applying OneSC to real data, we inferred a core TF network using a mouse myeloid progenitor scRNA-seq dataset and showed that the dynamical simulations of that network generate synthetic single-cell expression profiles that faithfully recapitulate the four myeloid differentiation trajectories going into differentiated cell states (erythrocytes, megakaryocytes, granulocytes and monocytes). Finally, through the in silico perturbations of the mouse myeloid progenitor core network, we showed that OneSC can accurately predict cell fate decision biases of TF perturbations that closely match with previous experimental observations.
Availability: OneSC is implemented as a Python package on GitHub (https://github.com/CahanLab/oneSC) and on Zenodo (https://zenodo.org/records/14052421).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"OneSC: A computational platform for recapitulating cell state transitions.","authors":"Da Peng, Patrick Cahan","doi":"10.1093/bioinformatics/btae703","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae703","url":null,"abstract":"<p><strong>Motivation: </strong>Computational modelling of cell state transitions has been a great interest of many in the field of developmental biology, cancer biology and cell fate engineering because it enables performing perturbation experiments in silico more rapidly and cheaply than could be achieved in a lab. Recent advancements in single-cell RNA sequencing (scRNA-seq) allow the capture of high-resolution snapshots of cell states as they transition along temporal trajectories. Using these high-throughput datasets, we can train computational models to generate in silico 'synthetic' cells that faithfully mimic the temporal trajectories.</p><p><strong>Results: </strong>Here we present OneSC, a platform that can simulate cell state transitions using systems of stochastic differential equations govern by a regulatory network of core transcription factors (TFs). Different from many current network inference methods, OneSC prioritizes on generating Boolean network that produces faithful cell state transitions and terminal cell states that mimic real biological systems. Applying OneSC to real data, we inferred a core TF network using a mouse myeloid progenitor scRNA-seq dataset and showed that the dynamical simulations of that network generate synthetic single-cell expression profiles that faithfully recapitulate the four myeloid differentiation trajectories going into differentiated cell states (erythrocytes, megakaryocytes, granulocytes and monocytes). Finally, through the in silico perturbations of the mouse myeloid progenitor core network, we showed that OneSC can accurately predict cell fate decision biases of TF perturbations that closely match with previous experimental observations.</p><p><strong>Availability: </strong>OneSC is implemented as a Python package on GitHub (https://github.com/CahanLab/oneSC) and on Zenodo (https://zenodo.org/records/14052421).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-21DOI: 10.1093/bioinformatics/btae693
Zhengchao Luo, Wei Wu, Qichen Sun, Jinzhuo Wang
Motivation: Accurate prediction of drug-target interactions (DTIs), especially for novel targets or drugs, is crucial for accelerating drug discovery. Recent advances in pretrained language models (PLMs) and multi-modal learning present new opportunities to enhance DTI prediction by leveraging vast unlabeled molecular data and integrating complementary information from multiple modalities.
Results: We introduce DrugLAMP (PLM-Assisted Multi-modal Prediction), a PLM-based multi-modal framework for accurate and transferable DTI prediction. DrugLAMP integrates molecular graph and protein sequence features extracted by PLMs and traditional feature extractors. We introduce two novel multi-modal fusion modules: (1) Pocket-guided Co-Attention (PGCA), which uses protein pocket information to guide the attention mechanism on drug features, and (2) Paired Multi-modal Attention (PMMA), which enables effective cross-modal interactions between drug and protein features. These modules work together to enhance the model's ability to capture complex drug-protein interactions. Moreover, the Contrastive Compound-Protein Pre-training (2C2P) module enhances the model's generalization to real-world scenarios by aligning features across modalities and conditions. Comprehensive experiments demonstrate DrugLAMP's state-of-the-art performance on both standard benchmarks and challenging settings simulating real-world drug discovery, where test drugs/targets are unseen during training. Visualizations of attention maps and application to predict cryptic pockets and drug side effects further showcase DrugLAMP's strong interpretability and generalizability. Ablation studies confirm the contributions of the proposed modules.
Availability: Source code and datasets are freely available at https://github.com/Lzcstan/DrugLAMP. All data originate from public sources.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Accurate and Transferable Drug-Target Interaction Prediction with DrugLAMP.","authors":"Zhengchao Luo, Wei Wu, Qichen Sun, Jinzhuo Wang","doi":"10.1093/bioinformatics/btae693","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae693","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate prediction of drug-target interactions (DTIs), especially for novel targets or drugs, is crucial for accelerating drug discovery. Recent advances in pretrained language models (PLMs) and multi-modal learning present new opportunities to enhance DTI prediction by leveraging vast unlabeled molecular data and integrating complementary information from multiple modalities.</p><p><strong>Results: </strong>We introduce DrugLAMP (PLM-Assisted Multi-modal Prediction), a PLM-based multi-modal framework for accurate and transferable DTI prediction. DrugLAMP integrates molecular graph and protein sequence features extracted by PLMs and traditional feature extractors. We introduce two novel multi-modal fusion modules: (1) Pocket-guided Co-Attention (PGCA), which uses protein pocket information to guide the attention mechanism on drug features, and (2) Paired Multi-modal Attention (PMMA), which enables effective cross-modal interactions between drug and protein features. These modules work together to enhance the model's ability to capture complex drug-protein interactions. Moreover, the Contrastive Compound-Protein Pre-training (2C2P) module enhances the model's generalization to real-world scenarios by aligning features across modalities and conditions. Comprehensive experiments demonstrate DrugLAMP's state-of-the-art performance on both standard benchmarks and challenging settings simulating real-world drug discovery, where test drugs/targets are unseen during training. Visualizations of attention maps and application to predict cryptic pockets and drug side effects further showcase DrugLAMP's strong interpretability and generalizability. Ablation studies confirm the contributions of the proposed modules.</p><p><strong>Availability: </strong>Source code and datasets are freely available at https://github.com/Lzcstan/DrugLAMP. All data originate from public sources.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Phylogenetic reconstruction is a fundamental problem in computational biology. The Neighbor Joining (NJ) algorithm offers an efficient distance-based solution to this problem, which often serves as the foundation for more advanced statistical methods. Despite prior efforts to enhance the speed of NJ, the computation of the n 2 entries of the distance matrix, where n is the number of phylogenetic tree leaves, continues to pose a limitation in scaling NJ to larger datasets.
Results: In this work, we propose a new algorithm which does not require computing a dense distance matrix. Instead, it dynamically determines a sparse set of at most O(n log n) distance matrix entries to be computed in its basic version, and up to O(n log 2n) entries in an enhanced version. We show by experiments that this approach reduces the execution time of NJ for large datasets, with a trade-off in accuracy.
Availability and implementation: Sparse Neighbor Joining is implemented in Python and freely available at https://github.com/kurtsemih/SNJ.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Sparse Neighbor Joining: rapid phylogenetic inference using a sparse distance matrix.","authors":"Semih Kurt, Alexandre Bouchard-Côté, Jens Lagergren","doi":"10.1093/bioinformatics/btae701","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae701","url":null,"abstract":"<p><strong>Motivation: </strong>Phylogenetic reconstruction is a fundamental problem in computational biology. The Neighbor Joining (NJ) algorithm offers an efficient distance-based solution to this problem, which often serves as the foundation for more advanced statistical methods. Despite prior efforts to enhance the speed of NJ, the computation of the n 2 entries of the distance matrix, where n is the number of phylogenetic tree leaves, continues to pose a limitation in scaling NJ to larger datasets.</p><p><strong>Results: </strong>In this work, we propose a new algorithm which does not require computing a dense distance matrix. Instead, it dynamically determines a sparse set of at most O(n log n) distance matrix entries to be computed in its basic version, and up to O(n log 2n) entries in an enhanced version. We show by experiments that this approach reduces the execution time of NJ for large datasets, with a trade-off in accuracy.</p><p><strong>Availability and implementation: </strong>Sparse Neighbor Joining is implemented in Python and freely available at https://github.com/kurtsemih/SNJ.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}