Pub Date : 2025-02-28DOI: 10.1093/bioinformatics/btaf093
Anish K Simhal, Corey Weistuch, Kevin Murgas, Daniel Grange, Jiening Zhu, Jung Hun Oh, Rena Elkin, Joseph O Deasy
Motivation: Although recent advanced sequencing technologies have improved the resolution of genomic and proteomic data to better characterize molecular phenotypes, efficient computational tools to analyze and interpret large-scale omic data are still needed.
Results: To address this, we have developed a network-based bioinformatic tool called Ollivier-Ricci curvature for omics (ORCO). ORCO incorporates omics data and a network describing biological relationships between the genes or proteins and computes Ollivier-Ricci curvature (ORC) values for individual interactions. ORC is an edge-based measure that assesses network robustness. It captures functional cooperation in gene signaling using a consistent information-passing measure, which can help investigators identify therapeutic targets and key regulatory modules in biological systems. ORC has identified novel insights in multiple cancer types using genomic data and in neurodevelopmental disorders using brain imaging data. This tool is applicable to any data that can be represented as a network.
Availability: ORCO is an open-source Python package and is publicly available on GitHub at https://github.com/aksimhal/ORC-Omics.
Supplementary information: Code and notebooks are available at github.com/aksimhal/ORC-Omics.
{"title":"ORCO: Ollivier-Ricci Curvature-Omics-an unsupervised method for analyzing robustness in biological systems.","authors":"Anish K Simhal, Corey Weistuch, Kevin Murgas, Daniel Grange, Jiening Zhu, Jung Hun Oh, Rena Elkin, Joseph O Deasy","doi":"10.1093/bioinformatics/btaf093","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf093","url":null,"abstract":"<p><strong>Motivation: </strong>Although recent advanced sequencing technologies have improved the resolution of genomic and proteomic data to better characterize molecular phenotypes, efficient computational tools to analyze and interpret large-scale omic data are still needed.</p><p><strong>Results: </strong>To address this, we have developed a network-based bioinformatic tool called Ollivier-Ricci curvature for omics (ORCO). ORCO incorporates omics data and a network describing biological relationships between the genes or proteins and computes Ollivier-Ricci curvature (ORC) values for individual interactions. ORC is an edge-based measure that assesses network robustness. It captures functional cooperation in gene signaling using a consistent information-passing measure, which can help investigators identify therapeutic targets and key regulatory modules in biological systems. ORC has identified novel insights in multiple cancer types using genomic data and in neurodevelopmental disorders using brain imaging data. This tool is applicable to any data that can be represented as a network.</p><p><strong>Availability: </strong>ORCO is an open-source Python package and is publicly available on GitHub at https://github.com/aksimhal/ORC-Omics.</p><p><strong>Supplementary information: </strong>Code and notebooks are available at github.com/aksimhal/ORC-Omics.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-27DOI: 10.1093/bioinformatics/btaf092
Joel Selvaraj, Liguo Wang, Jianlin Cheng
Motivation: Cryogenic Electron Microscopy (cryo-EM) is a core experimental technique used to determine the structure of macromolecules such as proteins. However, the effectiveness of cryo-EM is often hindered by the noise and missing density values in cryo-EM density maps caused by experimental conditions such as low contrast and conformational heterogeneity. Although various global and local map sharpening techniques are widely employed to improve cryo-EM density maps, it is still challenging to efficiently improve their quality for building better protein structures from them.
Results: In this study, we introduce CryoTEN-a three-dimensional UNETR ++ style transformer to improve cryo-EM maps effectively. CryoTEN is trained using a diverse set of 1,295 cryo-EM maps as inputs and their corresponding simulated maps generated from known protein structures as targets. An independent test set containing 150 maps is used to evaluate CryoTEN, and the results demonstrate that it can robustly enhance the quality of cryo-EM density maps. In addition, automatic de novo protein structure modeling shows that protein structures built from the density maps processed by CryoTEN have substantially better quality than those built from the original maps. Compared to the existing state-of-the-art deep learning methods for enhancing cryo-EM density maps, CryoTEN ranks second in improving the quality of density maps, while running > 10 times faster and requiring much less GPU memory than them.
Availability and implementation: The source code and data is freely available at https://github.com/jianlin-cheng/cryoten.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"CryoTEN: Efficiently Enhancing cryo-EM Density Maps Using Transformers.","authors":"Joel Selvaraj, Liguo Wang, Jianlin Cheng","doi":"10.1093/bioinformatics/btaf092","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf092","url":null,"abstract":"<p><strong>Motivation: </strong>Cryogenic Electron Microscopy (cryo-EM) is a core experimental technique used to determine the structure of macromolecules such as proteins. However, the effectiveness of cryo-EM is often hindered by the noise and missing density values in cryo-EM density maps caused by experimental conditions such as low contrast and conformational heterogeneity. Although various global and local map sharpening techniques are widely employed to improve cryo-EM density maps, it is still challenging to efficiently improve their quality for building better protein structures from them.</p><p><strong>Results: </strong>In this study, we introduce CryoTEN-a three-dimensional UNETR ++ style transformer to improve cryo-EM maps effectively. CryoTEN is trained using a diverse set of 1,295 cryo-EM maps as inputs and their corresponding simulated maps generated from known protein structures as targets. An independent test set containing 150 maps is used to evaluate CryoTEN, and the results demonstrate that it can robustly enhance the quality of cryo-EM density maps. In addition, automatic de novo protein structure modeling shows that protein structures built from the density maps processed by CryoTEN have substantially better quality than those built from the original maps. Compared to the existing state-of-the-art deep learning methods for enhancing cryo-EM density maps, CryoTEN ranks second in improving the quality of density maps, while running > 10 times faster and requiring much less GPU memory than them.</p><p><strong>Availability and implementation: </strong>The source code and data is freely available at https://github.com/jianlin-cheng/cryoten.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-27DOI: 10.1093/bioinformatics/btaf026
Rui Liu, Zhengwu Zhang, Hyejung Won, J S Marron
Motivation: Hi-C technology has been developed to profile genome-wide chromosome conformation. So far Hi-C data has been generated from a large compendium of different cell types and different tissue types. Among different chromatin conformation units, chromatin loops were found to play a key role in gene regulation across different cell types. While many different loop calling algorithms have been developed, most loop callers identified shared loops as opposed to cell type specific loops.
Results: We propose SSSHiC, a new loop calling algorithm based on significance in scale space, which can be used to understand data at different levels of resolution. By applying SSSHiCto neuronal and glial Hi-C data, we detected more loops that are potentially engaged in cell type specific gene regulation. Compared with other loop callers, such as Mustache, these loops were more frequently anchored to gene promoters of cellular marker genes and had better APA scores. Therefore, our results suggest that SSSHiCcan effectively capture loops that contain more gene regulatory information.
Availability and implementation: The Hi-C data used in this study can be accessed through the PsychENCODE Knowledge Portal at https://www.synapse.org/#! Synapse: syn21760712. The code utilized for Curvature SSS cited in this study is available at https://github.com/jsmarron/MarronMatlabSoftware/blob/master/Matlab9/Matlab9Combined.zip. All custom code used in this research can be found in the GitHub repository: https://github.com/jerryliu01998/HiC. The code has also been submitted to Code Ocean with the DOI: 10.24433/CO.1912913.v1.
Contact: For inquiries or support, please contact [Rui Liu, jerryliu@unc.edu]. Collaboration and feedback from the community are welcome.
Supplementary information: Supplementary data supporting this study, including example datasets and detailed evaluations, are available at https://github.com/jerryliu01998/HiC/blob/main/SM_HiC_202408.pdf. Additional results are included in the Supplementary Materials section.
{"title":"Significance in Scale Space for Hi-C Data.","authors":"Rui Liu, Zhengwu Zhang, Hyejung Won, J S Marron","doi":"10.1093/bioinformatics/btaf026","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf026","url":null,"abstract":"<p><strong>Motivation: </strong>Hi-C technology has been developed to profile genome-wide chromosome conformation. So far Hi-C data has been generated from a large compendium of different cell types and different tissue types. Among different chromatin conformation units, chromatin loops were found to play a key role in gene regulation across different cell types. While many different loop calling algorithms have been developed, most loop callers identified shared loops as opposed to cell type specific loops.</p><p><strong>Results: </strong>We propose SSSHiC, a new loop calling algorithm based on significance in scale space, which can be used to understand data at different levels of resolution. By applying SSSHiCto neuronal and glial Hi-C data, we detected more loops that are potentially engaged in cell type specific gene regulation. Compared with other loop callers, such as Mustache, these loops were more frequently anchored to gene promoters of cellular marker genes and had better APA scores. Therefore, our results suggest that SSSHiCcan effectively capture loops that contain more gene regulatory information.</p><p><strong>Availability and implementation: </strong>The Hi-C data used in this study can be accessed through the PsychENCODE Knowledge Portal at https://www.synapse.org/#! Synapse: syn21760712. The code utilized for Curvature SSS cited in this study is available at https://github.com/jsmarron/MarronMatlabSoftware/blob/master/Matlab9/Matlab9Combined.zip. All custom code used in this research can be found in the GitHub repository: https://github.com/jerryliu01998/HiC. The code has also been submitted to Code Ocean with the DOI: 10.24433/CO.1912913.v1.</p><p><strong>Contact: </strong>For inquiries or support, please contact [Rui Liu, jerryliu@unc.edu]. Collaboration and feedback from the community are welcome.</p><p><strong>Supplementary information: </strong>Supplementary data supporting this study, including example datasets and detailed evaluations, are available at https://github.com/jerryliu01998/HiC/blob/main/SM_HiC_202408.pdf. Additional results are included in the Supplementary Materials section.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-25DOI: 10.1093/bioinformatics/btaf084
Antonino Zito, Axel Martinelli, Mauro Masiero, Murat Akhmedov, Ivo Kwee
Motivation: Batch effects (BEs) are a predominant source of noise in omics data and often mask real biological signals. BEs remain common in existing datasets. Current methods for BE correction mostly rely on specific assumptions or complex models, and may not detect and adjust BEs adequately, impacting downstream analysis and discovery power. To address these challenges we developed NPM, a nearest-neighbor matching-based method that adjusts BEs and may outperform other methods in a wide range of datasets.
Results: We assessed distinct metrics and graphical readouts, and compared our method to commonly used BE correction methods. NPM demonstrates ability in correcting for BEs, while preserving biological differences. It may outperform other methods based on multiple metrics. Altogether, NPM proves to be a valuable BE correction approach to maximize discovery in biomedical research, with applicability in clinical research where latent BEs are often dominant.
Availability: NPM is freely available on GitHub (https://github.com/bigomics/NPM) and on Omics Playground (https://bigomics.ch/omics-playground). Computer codes for analyses are available at (https://github.com/bigomics/NPM). The datasets underlying this article are the following: GSE120099, GSE82177, GSE162760, GSE171343, GSE153380, GSE163214, GSE182440, GSE163857, GSE117970, GSE173078, GSE10846. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus (GEO) repository.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"NPM: Latent Batch Effects Correction of Omics data by Nearest Pair Matching.","authors":"Antonino Zito, Axel Martinelli, Mauro Masiero, Murat Akhmedov, Ivo Kwee","doi":"10.1093/bioinformatics/btaf084","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf084","url":null,"abstract":"<p><strong>Motivation: </strong>Batch effects (BEs) are a predominant source of noise in omics data and often mask real biological signals. BEs remain common in existing datasets. Current methods for BE correction mostly rely on specific assumptions or complex models, and may not detect and adjust BEs adequately, impacting downstream analysis and discovery power. To address these challenges we developed NPM, a nearest-neighbor matching-based method that adjusts BEs and may outperform other methods in a wide range of datasets.</p><p><strong>Results: </strong>We assessed distinct metrics and graphical readouts, and compared our method to commonly used BE correction methods. NPM demonstrates ability in correcting for BEs, while preserving biological differences. It may outperform other methods based on multiple metrics. Altogether, NPM proves to be a valuable BE correction approach to maximize discovery in biomedical research, with applicability in clinical research where latent BEs are often dominant.</p><p><strong>Availability: </strong>NPM is freely available on GitHub (https://github.com/bigomics/NPM) and on Omics Playground (https://bigomics.ch/omics-playground). Computer codes for analyses are available at (https://github.com/bigomics/NPM). The datasets underlying this article are the following: GSE120099, GSE82177, GSE162760, GSE171343, GSE153380, GSE163214, GSE182440, GSE163857, GSE117970, GSE173078, GSE10846. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus (GEO) repository.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143506726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-25DOI: 10.1093/bioinformatics/btaf090
Wendao Liu, Zhongming Zhao
Motivation: Immune cells undergo cytokine-driven polarization in response to diverse stimuli, altering their transcriptional profiles and functional states. This dynamic process is central to immune responses in health and diseases, yet a systematic approach to assess cytokine-driven polarization in single-cell RNA sequencing data has been lacking.
Results: To address this gap, we developed Single-cell unified polarization assessment (Scupa), the first computational method for comprehensive immune cell polarization assessment. Scupa leverages data from the Immune Dictionary, which characterizes cytokine-driven polarization states across 14 immune cell types. By integrating cell embeddings from the single-cell foundation model Universal Cell Embeddings, Scupa effectively identifies polarized cells across different species and experimental conditions. Applications of Scupa in independent datasets demonstrated its accuracy in classifying polarized cells and further revealed distinct polarization profiles in tumor-infiltrating myeloid cells across cancers. Scupa complements conventional single-cell data analysis by providing new insights into dynamic immune cell states, and holds potential for advancing therapeutic insights, particularly in cytokine-based therapies.
Availability and implementation: The code is available at https://github.com/bsml320/Scupa.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Scupa: Single-cell unified polarization assessment of immune cells using the single-cell foundation model.","authors":"Wendao Liu, Zhongming Zhao","doi":"10.1093/bioinformatics/btaf090","DOIUrl":"10.1093/bioinformatics/btaf090","url":null,"abstract":"<p><strong>Motivation: </strong>Immune cells undergo cytokine-driven polarization in response to diverse stimuli, altering their transcriptional profiles and functional states. This dynamic process is central to immune responses in health and diseases, yet a systematic approach to assess cytokine-driven polarization in single-cell RNA sequencing data has been lacking.</p><p><strong>Results: </strong>To address this gap, we developed Single-cell unified polarization assessment (Scupa), the first computational method for comprehensive immune cell polarization assessment. Scupa leverages data from the Immune Dictionary, which characterizes cytokine-driven polarization states across 14 immune cell types. By integrating cell embeddings from the single-cell foundation model Universal Cell Embeddings, Scupa effectively identifies polarized cells across different species and experimental conditions. Applications of Scupa in independent datasets demonstrated its accuracy in classifying polarized cells and further revealed distinct polarization profiles in tumor-infiltrating myeloid cells across cancers. Scupa complements conventional single-cell data analysis by providing new insights into dynamic immune cell states, and holds potential for advancing therapeutic insights, particularly in cytokine-based therapies.</p><p><strong>Availability and implementation: </strong>The code is available at https://github.com/bsml320/Scupa.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143506727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-25DOI: 10.1093/bioinformatics/btaf091
Kari Lavikka, Altti Ilari Maarala, Jaana Oikkonen, Sampsa Hautaniemi
Summary: Spatial and temporal intra-tumor heterogeneity drives tumor evolution and therapy resistance. Existing visualization tools often fail to capture both dimensions simultaneously. To address this, we developed Jellyfish, a tool that integrates phylogenetic and sample trees into a single plot, providing a holistic view of tumor evolution and capturing both spatial and temporal evolution. Available as a JavaScript library and R package, Jellyfish generates interactive visualizations from tumor phylogeny and clonal composition data. We demonstrate its ability to visualize complex subclonal dynamics using data from ovarian high-grade serous carcinoma.
Availability and implementation: Jellyfish is freely available with MIT license at https://github.com/HautaniemiLab/jellyfish (JavaScript library) and https://github.com/HautaniemiLab/jellyfisher (R package).
{"title":"Jellyfish: integrative visualization of spatio-temporal tumor evolution and clonal dynamics.","authors":"Kari Lavikka, Altti Ilari Maarala, Jaana Oikkonen, Sampsa Hautaniemi","doi":"10.1093/bioinformatics/btaf091","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf091","url":null,"abstract":"<p><strong>Summary: </strong>Spatial and temporal intra-tumor heterogeneity drives tumor evolution and therapy resistance. Existing visualization tools often fail to capture both dimensions simultaneously. To address this, we developed Jellyfish, a tool that integrates phylogenetic and sample trees into a single plot, providing a holistic view of tumor evolution and capturing both spatial and temporal evolution. Available as a JavaScript library and R package, Jellyfish generates interactive visualizations from tumor phylogeny and clonal composition data. We demonstrate its ability to visualize complex subclonal dynamics using data from ovarian high-grade serous carcinoma.</p><p><strong>Availability and implementation: </strong>Jellyfish is freely available with MIT license at https://github.com/HautaniemiLab/jellyfish (JavaScript library) and https://github.com/HautaniemiLab/jellyfisher (R package).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143506725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples.
Results: To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model's input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon's effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning.
Availability: The codes of PharaCon are now available in: https://github.com/Celestial-Bai/PharaCon.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"PharaCon: A new framework for identifying bacteriophages via conditional representation learning.","authors":"Zeheng Bai, Yao-Zhong Zhang, Yuxuan Pang, Seiya Imoto","doi":"10.1093/bioinformatics/btaf085","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf085","url":null,"abstract":"<p><strong>Motivation: </strong>Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples.</p><p><strong>Results: </strong>To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model's input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon's effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning.</p><p><strong>Availability: </strong>The codes of PharaCon are now available in: https://github.com/Celestial-Bai/PharaCon.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143485018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-24DOI: 10.1093/bioinformatics/btaf083
Xindian Wei, Tianyi Chen, Xibiao Wang, Wenjun Shen, Cheng Liu, Si Wu, Hau-San Wong
Motivation: Single-cell RNA sequencing (scRNA-seq) enables high-throughput transcriptomic profiling at single-cell resolution. The inherent spatial location is crucial for understanding how single cells orchestrate multicellular functions and drive diseases. However, spatial information is often lost during tissue dissociation. Spatial transcriptomic (ST) technologies can provide precise spatial gene expression atlas, while their practicality is constrained by the number of genes they can assay or the associated costs at a larger scale and the fine-grained cell type annotation. By transferring knowledge between scRNA-seq and spatial transcriptomics data through cell correspondence learning, it is possible to recover the spatial properties inherent in scRNA-seq datasets.
Results: In this study, we introduce COME, a COntrastive Mapping lEarning approach that learns mapping between ST and scRNA-seq data to recover the spatial information of scRNA-seq data. Extensive experiments demonstrate that the proposed COME method effectively captures precise cell-spot relationships and outperforms previous methods in recovering spatial location for scRNA-seq data. More importantly, our method is capable of precisely identifying biologically meaningful information within the data, such as the spatial structure of missing genes, spatial hierarchical patterns, and the cell-type compositions for each spot. These results indicate that the proposed COME method can help to understand the heterogeneity and activities among cells within tissue environments.
Availability and implementation: The COME is freely available in GitHub (https://github.com/cindyway/COME).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"COME: contrastive mapping learning for spatial reconstruction of scRNA-seq data.","authors":"Xindian Wei, Tianyi Chen, Xibiao Wang, Wenjun Shen, Cheng Liu, Si Wu, Hau-San Wong","doi":"10.1093/bioinformatics/btaf083","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf083","url":null,"abstract":"<p><strong>Motivation: </strong>Single-cell RNA sequencing (scRNA-seq) enables high-throughput transcriptomic profiling at single-cell resolution. The inherent spatial location is crucial for understanding how single cells orchestrate multicellular functions and drive diseases. However, spatial information is often lost during tissue dissociation. Spatial transcriptomic (ST) technologies can provide precise spatial gene expression atlas, while their practicality is constrained by the number of genes they can assay or the associated costs at a larger scale and the fine-grained cell type annotation. By transferring knowledge between scRNA-seq and spatial transcriptomics data through cell correspondence learning, it is possible to recover the spatial properties inherent in scRNA-seq datasets.</p><p><strong>Results: </strong>In this study, we introduce COME, a COntrastive Mapping lEarning approach that learns mapping between ST and scRNA-seq data to recover the spatial information of scRNA-seq data. Extensive experiments demonstrate that the proposed COME method effectively captures precise cell-spot relationships and outperforms previous methods in recovering spatial location for scRNA-seq data. More importantly, our method is capable of precisely identifying biologically meaningful information within the data, such as the spatial structure of missing genes, spatial hierarchical patterns, and the cell-type compositions for each spot. These results indicate that the proposed COME method can help to understand the heterogeneity and activities among cells within tissue environments.</p><p><strong>Availability and implementation: </strong>The COME is freely available in GitHub (https://github.com/cindyway/COME).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143485016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-22DOI: 10.1093/bioinformatics/btaf087
Ali Hamraoui, Laurent Jourdren, Morgane Thomas-Chollier
Motivation: The combination of long-read sequencing technologies like Oxford Nanopore with single-cell RNA sequencing (scRNAseq) assays enables the detailed exploration of transcriptomic complexity, including isoform detection and quantification, by capturing full-length cDNAs. However, challenges remain, including the lack of advanced simulation tools that can effectively mimic the unique complexities of scRNAseq long-read datasets. Such tools are essential for the evaluation and optimization of isoform detection methods dedicated to single-cell long read studies.
Results: We developed AsaruSim, a workflow that simulates synthetic single-cell long-read Nanopore datasets, closely mimicking real experimental data. AsaruSim employs a multi-step process that includes the creation of a synthetic UMI count matrix, generation of perfect reads, optional PCR amplification, introduction of sequencing errors, and comprehensive quality control reporting. Applied to a dataset of human peripheral blood mononuclear cells (PBMCs), AsaruSim accurately reproduced experimental read characteristics.
Availability: The source code and full documentation are available at: https://github.com/GenomiqueENS/AsaruSim.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"AsaruSim: a single-cell and spatial RNA-Seq Nanopore long-reads simulation workflow.","authors":"Ali Hamraoui, Laurent Jourdren, Morgane Thomas-Chollier","doi":"10.1093/bioinformatics/btaf087","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf087","url":null,"abstract":"<p><strong>Motivation: </strong>The combination of long-read sequencing technologies like Oxford Nanopore with single-cell RNA sequencing (scRNAseq) assays enables the detailed exploration of transcriptomic complexity, including isoform detection and quantification, by capturing full-length cDNAs. However, challenges remain, including the lack of advanced simulation tools that can effectively mimic the unique complexities of scRNAseq long-read datasets. Such tools are essential for the evaluation and optimization of isoform detection methods dedicated to single-cell long read studies.</p><p><strong>Results: </strong>We developed AsaruSim, a workflow that simulates synthetic single-cell long-read Nanopore datasets, closely mimicking real experimental data. AsaruSim employs a multi-step process that includes the creation of a synthetic UMI count matrix, generation of perfect reads, optional PCR amplification, introduction of sequencing errors, and comprehensive quality control reporting. Applied to a dataset of human peripheral blood mononuclear cells (PBMCs), AsaruSim accurately reproduced experimental read characteristics.</p><p><strong>Availability: </strong>The source code and full documentation are available at: https://github.com/GenomiqueENS/AsaruSim.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143476919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-22DOI: 10.1093/bioinformatics/btaf077
Vikash Pandey
Motivation: Modeling genome-scale metabolic networks (GEMs) helps understand metabolic fluxes in cells at a specific state under defined environmental conditions or perturbations. Elementary Flux Modes (EFMs) are powerful tools for simplifying complex metabolic networks into smaller, more manageable pathways. However, the enumeration of all EFMs, especially within GEMs, poses significant challenges due to computational complexity. Additionally, traditional EFM approaches often fail to capture essential aspects of metabolism, such as co-factor balancing and by-product generation.The previously developed Minimum Network Enrichment Analysis (MiNEA) method addresses these limitations by enumerating alternative minimal networks for given biomass building blocks and metabolic tasks. MiNEA facilitates a deeper understanding of metabolic task flexibility and context-specific metabolic routes by integrating condition-specific transcriptomics, proteomics, and metabolomics data. This approach offers significant improvements in the analysis of metabolic pathways, providing more comprehensive insights into cellular metabolism.
Results: Here, I present MiNEApy, a Python package reimplementation of MiNEA, which computes minimal networks and performs enrichment analysis. I demonstrate the application of MiNEApy on both a small-scale and a genome-scale model of the bacterium E. coli, showcasing its ability to conduct minimal network enrichment analysis using minimal networks and context-specific data.
Availability: MiNEApy can be accessed at: https://github.com/vpandey-om/mineapy.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"MiNEApy: Enhancing Enrichment Network Analysis in Metabolic Networks.","authors":"Vikash Pandey","doi":"10.1093/bioinformatics/btaf077","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf077","url":null,"abstract":"<p><strong>Motivation: </strong>Modeling genome-scale metabolic networks (GEMs) helps understand metabolic fluxes in cells at a specific state under defined environmental conditions or perturbations. Elementary Flux Modes (EFMs) are powerful tools for simplifying complex metabolic networks into smaller, more manageable pathways. However, the enumeration of all EFMs, especially within GEMs, poses significant challenges due to computational complexity. Additionally, traditional EFM approaches often fail to capture essential aspects of metabolism, such as co-factor balancing and by-product generation.The previously developed Minimum Network Enrichment Analysis (MiNEA) method addresses these limitations by enumerating alternative minimal networks for given biomass building blocks and metabolic tasks. MiNEA facilitates a deeper understanding of metabolic task flexibility and context-specific metabolic routes by integrating condition-specific transcriptomics, proteomics, and metabolomics data. This approach offers significant improvements in the analysis of metabolic pathways, providing more comprehensive insights into cellular metabolism.</p><p><strong>Results: </strong>Here, I present MiNEApy, a Python package reimplementation of MiNEA, which computes minimal networks and performs enrichment analysis. I demonstrate the application of MiNEApy on both a small-scale and a genome-scale model of the bacterium E. coli, showcasing its ability to conduct minimal network enrichment analysis using minimal networks and context-specific data.</p><p><strong>Availability: </strong>MiNEApy can be accessed at: https://github.com/vpandey-om/mineapy.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143476937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}