Motivation: Accurate quantification of genotype uncertainty is pivotal in ensuring the reliability of genetic inferences drawn from NGS data. Genotype uncertainty is typically modeled using Genotype Likelihoods (GLs), which can help propagate measures of statistical uncertainty in base calls to downstream analyses. However, the effects of errors and biases in the estimation of GLs, introduced by biases in the original base call quality scores or the discretization of quality scores, as well as the choice of the GL model, remain under-explored.
Results: We present vcfgl, a versatile tool for simulating genotype likelihoods associated with simulated read data. It offers a framework for researchers to simulate and investigate the uncertainties and biases associated with the quantification of uncertainty, thereby facilitating a deeper understanding of their impacts on downstream analytical methods. Through simulations, we demonstrate the utility of vcfgl in benchmarking GL-based methods. The program can calculate GLs using various widely used genotype likelihood models and can simulate the errors in quality scores using a Beta distribution. It is compatible with modern simulators such as msprime and SLiM, and can output data in pileup, VCF/BCF and gVCF file formats, supporting a wide range of applications. The vcfgl program is freely available as an efficient and user-friendly software written in C/C ++.
Availability: vcfgl is freely available at https://github.com/isinaltinkaya/vcfgl.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Vcfgl: A flexible genotype likelihood simulator for VCF/BCF files.","authors":"Isin Altinkaya, Rasmus Nielsen, Thorfinn Sand Korneliussen","doi":"10.1093/bioinformatics/btaf098","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf098","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate quantification of genotype uncertainty is pivotal in ensuring the reliability of genetic inferences drawn from NGS data. Genotype uncertainty is typically modeled using Genotype Likelihoods (GLs), which can help propagate measures of statistical uncertainty in base calls to downstream analyses. However, the effects of errors and biases in the estimation of GLs, introduced by biases in the original base call quality scores or the discretization of quality scores, as well as the choice of the GL model, remain under-explored.</p><p><strong>Results: </strong>We present vcfgl, a versatile tool for simulating genotype likelihoods associated with simulated read data. It offers a framework for researchers to simulate and investigate the uncertainties and biases associated with the quantification of uncertainty, thereby facilitating a deeper understanding of their impacts on downstream analytical methods. Through simulations, we demonstrate the utility of vcfgl in benchmarking GL-based methods. The program can calculate GLs using various widely used genotype likelihood models and can simulate the errors in quality scores using a Beta distribution. It is compatible with modern simulators such as msprime and SLiM, and can output data in pileup, VCF/BCF and gVCF file formats, supporting a wide range of applications. The vcfgl program is freely available as an efficient and user-friendly software written in C/C ++.</p><p><strong>Availability: </strong>vcfgl is freely available at https://github.com/isinaltinkaya/vcfgl.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143568841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf090
Wendao Liu, Zhongming Zhao
Motivation: Immune cells undergo cytokine-driven polarization in response to diverse stimuli, altering their transcriptional profiles and functional states. This dynamic process is central to immune responses in health and diseases, yet a systematic approach to assess cytokine-driven polarization in single-cell RNA sequencing data has been lacking.
Results: To address this gap, we developed single-cell unified polarization assessment (Scupa), the first computational method for comprehensive immune cell polarization assessment. Scupa leverages data from the Immune Dictionary, which characterizes cytokine-driven polarization states across 14 immune cell types. By integrating cell embeddings from the single-cell foundation model Universal Cell Embeddings, Scupa effectively identifies polarized cells across different species and experimental conditions. Applications of Scupa in independent datasets demonstrated its accuracy in classifying polarized cells and further revealed distinct polarization profiles in tumor-infiltrating myeloid cells across cancers. Scupa complements conventional single-cell data analysis by providing new insights into dynamic immune cell states, and holds potential for advancing therapeutic insights, particularly in cytokine-based therapies.
Availability and implementation: The code is available at https://github.com/bsml320/Scupa.
{"title":"Scupa: single-cell unified polarization assessment of immune cells using the single-cell foundation model.","authors":"Wendao Liu, Zhongming Zhao","doi":"10.1093/bioinformatics/btaf090","DOIUrl":"10.1093/bioinformatics/btaf090","url":null,"abstract":"<p><strong>Motivation: </strong>Immune cells undergo cytokine-driven polarization in response to diverse stimuli, altering their transcriptional profiles and functional states. This dynamic process is central to immune responses in health and diseases, yet a systematic approach to assess cytokine-driven polarization in single-cell RNA sequencing data has been lacking.</p><p><strong>Results: </strong>To address this gap, we developed single-cell unified polarization assessment (Scupa), the first computational method for comprehensive immune cell polarization assessment. Scupa leverages data from the Immune Dictionary, which characterizes cytokine-driven polarization states across 14 immune cell types. By integrating cell embeddings from the single-cell foundation model Universal Cell Embeddings, Scupa effectively identifies polarized cells across different species and experimental conditions. Applications of Scupa in independent datasets demonstrated its accuracy in classifying polarized cells and further revealed distinct polarization profiles in tumor-infiltrating myeloid cells across cancers. Scupa complements conventional single-cell data analysis by providing new insights into dynamic immune cell states, and holds potential for advancing therapeutic insights, particularly in cytokine-based therapies.</p><p><strong>Availability and implementation: </strong>The code is available at https://github.com/bsml320/Scupa.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893155/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143506727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf067
Manuel A Rivas, Christopher Chang
Motivation: The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.
Results: We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.
Availability and implementation: Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.
{"title":"Efficient storage and regression computation for population-scale genome sequencing studies.","authors":"Manuel A Rivas, Christopher Chang","doi":"10.1093/bioinformatics/btaf067","DOIUrl":"10.1093/bioinformatics/btaf067","url":null,"abstract":"<p><strong>Motivation: </strong>The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.</p><p><strong>Results: </strong>We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.</p><p><strong>Availability and implementation: </strong>Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893150/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143400670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf073
Ross F Laidlaw, Emma M Briggs, Keith R Matthews, Amir Madany Mamlouk, Richard McCulloch, Thomas D Otto
Motivation: Single-cell transcriptomics sequencing is used to compare different biological processes. However, often, those processes are asymmetric which are difficult to integrate. Current approaches often rely on integrating samples from each condition before either cluster-based comparisons or analysis of an inferred shared trajectory.
Results: We present Trajectory Alignment of Gene Expression Dynamics (TrAGEDy), which allows the alignment of independent trajectories to avoid the need for error-prone integration steps. Across simulated datasets, TrAGEDy returns the correct underlying alignment of the datasets, outperforming current tools which fail to capture the complexity of asymmetric alignments. When applied to real datasets, TrAGEDy captures more biologically relevant genes and processes, which other differential expression methods fail to detect when looking at the developments of T cells and the bloodstream forms of Trypanosoma brucei when affected by genetic knockouts.
Availability and implementation: TrAGEDy is freely available at https://github.com/No2Ross/TrAGEDy, and implemented in R.
{"title":"TrAGEDy-trajectory alignment of gene expression dynamics.","authors":"Ross F Laidlaw, Emma M Briggs, Keith R Matthews, Amir Madany Mamlouk, Richard McCulloch, Thomas D Otto","doi":"10.1093/bioinformatics/btaf073","DOIUrl":"10.1093/bioinformatics/btaf073","url":null,"abstract":"<p><strong>Motivation: </strong>Single-cell transcriptomics sequencing is used to compare different biological processes. However, often, those processes are asymmetric which are difficult to integrate. Current approaches often rely on integrating samples from each condition before either cluster-based comparisons or analysis of an inferred shared trajectory.</p><p><strong>Results: </strong>We present Trajectory Alignment of Gene Expression Dynamics (TrAGEDy), which allows the alignment of independent trajectories to avoid the need for error-prone integration steps. Across simulated datasets, TrAGEDy returns the correct underlying alignment of the datasets, outperforming current tools which fail to capture the complexity of asymmetric alignments. When applied to real datasets, TrAGEDy captures more biologically relevant genes and processes, which other differential expression methods fail to detect when looking at the developments of T cells and the bloodstream forms of Trypanosoma brucei when affected by genetic knockouts.</p><p><strong>Availability and implementation: </strong>TrAGEDy is freely available at https://github.com/No2Ross/TrAGEDy, and implemented in R.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11908647/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143598504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf084
Antonino Zito, Axel Martinelli, Mauro Masiero, Murodzhon Akhmedov, Ivo Kwee
Motivation: Batch effects (BEs) are a predominant source of noise in omics data and often mask real biological signals. BEs remain common in existing datasets. Current methods for BE correction mostly rely on specific assumptions or complex models, and may not detect and adjust BEs adequately, impacting downstream analysis and discovery power. To address these challenges we developed NPM, a nearest-neighbor matching-based method that adjusts BEs and may outperform other methods in a wide range of datasets.
Results: We assessed distinct metrics and graphical readouts, and compared our method to commonly used BE correction methods. NPM demonstrates the ability in correcting for BEs, while preserving biological differences. It may outperform other methods based on multiple metrics. Altogether, NPM proves to be a valuable BE correction approach to maximize discovery in biomedical research, with applicability in clinical research where latent BEs are often dominant.
Availability and implementation: NPM is freely available on GitHub (https://github.com/bigomics/NPM) and on Omics Playground (https://bigomics.ch/omics-playground). Computer codes for analyses are available at (https://github.com/bigomics/NPM). The datasets underlying this article are the following: GSE120099, GSE82177, GSE162760, GSE171343, GSE153380, GSE163214, GSE182440, GSE163857, GSE117970, GSE173078, and GSE10846. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus repository.
{"title":"NPM: latent batch effects correction of omics data by nearest-pair matching.","authors":"Antonino Zito, Axel Martinelli, Mauro Masiero, Murodzhon Akhmedov, Ivo Kwee","doi":"10.1093/bioinformatics/btaf084","DOIUrl":"10.1093/bioinformatics/btaf084","url":null,"abstract":"<p><strong>Motivation: </strong>Batch effects (BEs) are a predominant source of noise in omics data and often mask real biological signals. BEs remain common in existing datasets. Current methods for BE correction mostly rely on specific assumptions or complex models, and may not detect and adjust BEs adequately, impacting downstream analysis and discovery power. To address these challenges we developed NPM, a nearest-neighbor matching-based method that adjusts BEs and may outperform other methods in a wide range of datasets.</p><p><strong>Results: </strong>We assessed distinct metrics and graphical readouts, and compared our method to commonly used BE correction methods. NPM demonstrates the ability in correcting for BEs, while preserving biological differences. It may outperform other methods based on multiple metrics. Altogether, NPM proves to be a valuable BE correction approach to maximize discovery in biomedical research, with applicability in clinical research where latent BEs are often dominant.</p><p><strong>Availability and implementation: </strong>NPM is freely available on GitHub (https://github.com/bigomics/NPM) and on Omics Playground (https://bigomics.ch/omics-playground). Computer codes for analyses are available at (https://github.com/bigomics/NPM). The datasets underlying this article are the following: GSE120099, GSE82177, GSE162760, GSE171343, GSE153380, GSE163214, GSE182440, GSE163857, GSE117970, GSE173078, and GSE10846. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus repository.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11925496/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143506726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf068
Yanxin Jiao, Hongjia Li, Yang Xue, Guoliang Yang, Lei Qi, Fa Zhang, Dawei Zang, Renmin Han
Motivation: Cryo-electron tomography (cryo-ET) has revolutionized our ability to observe structures from the subcellular to the atomic level in their native states. Achieving high-resolution reconstruction involves collecting tilt series at different angles and subsequently backprojecting them into 3D space or iteratively reconstructing them to build a 3D volume of the specimen. However, the intricate computational demands of tomographic reconstruction pose significant challenges, requiring extensive calculation times that hinder efficiency, especially with large and complex datasets.
Results: We present TiltRec, an open-source toolkit that leverages the parallel capabilities of Central Processing Units and Graphics Processing Units to enhance tomographic reconstruction. TiltRec implements six classical tomographic reconstruction algorithms, utilizing optimized parallel computation strategies and advanced memory management techniques. Performance evaluations across multiple datasets of varying sizes demonstrate that TiltRec significantly improves efficiency, reducing computational times while maintaining reconstruction resolution.
Summary: TiltRec effectively addresses the computational challenges associated with cryo-ET reconstruction by fully exploiting parallel acceleration. As an open-source tool, TiltRec not only facilitates extensive applications by the research community but also supports further algorithm modifications and extensions, enabling the continued development of novel algorithms.
Availability and implementation: The source code, documentation, and sample data can be downloaded at https://github.com/icthrm/TiltRec.
{"title":"TiltRec: an ultra-fast and open-source toolkit for cryo-electron tomographic reconstruction.","authors":"Yanxin Jiao, Hongjia Li, Yang Xue, Guoliang Yang, Lei Qi, Fa Zhang, Dawei Zang, Renmin Han","doi":"10.1093/bioinformatics/btaf068","DOIUrl":"10.1093/bioinformatics/btaf068","url":null,"abstract":"<p><strong>Motivation: </strong>Cryo-electron tomography (cryo-ET) has revolutionized our ability to observe structures from the subcellular to the atomic level in their native states. Achieving high-resolution reconstruction involves collecting tilt series at different angles and subsequently backprojecting them into 3D space or iteratively reconstructing them to build a 3D volume of the specimen. However, the intricate computational demands of tomographic reconstruction pose significant challenges, requiring extensive calculation times that hinder efficiency, especially with large and complex datasets.</p><p><strong>Results: </strong>We present TiltRec, an open-source toolkit that leverages the parallel capabilities of Central Processing Units and Graphics Processing Units to enhance tomographic reconstruction. TiltRec implements six classical tomographic reconstruction algorithms, utilizing optimized parallel computation strategies and advanced memory management techniques. Performance evaluations across multiple datasets of varying sizes demonstrate that TiltRec significantly improves efficiency, reducing computational times while maintaining reconstruction resolution.</p><p><strong>Summary: </strong>TiltRec effectively addresses the computational challenges associated with cryo-ET reconstruction by fully exploiting parallel acceleration. As an open-source tool, TiltRec not only facilitates extensive applications by the research community but also supports further algorithm modifications and extensions, enabling the continued development of novel algorithms.</p><p><strong>Availability and implementation: </strong>The source code, documentation, and sample data can be downloaded at https://github.com/icthrm/TiltRec.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11886794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143416616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf063
George I Gavriilidis, Vasileios Vasileiou, Stella Dimitsaki, Georgios Karakatsoulis, Antonis Giannakakis, Georgios A Pavlopoulos, Fotis Psomopoulos
Motivation: Computational analyses of bulk and single-cell omics provide translational insights into complex diseases, such as COVID-19, by revealing molecules, cellular phenotypes, and signalling patterns that contribute to unfavourable clinical outcomes. Current in silico approaches dovetail differential abundance, biostatistics, and machine learning, but often overlook nonlinear proteomic dynamics, like post-translational modifications, and provide limited biological interpretability beyond feature ranking.
Results: We introduce APNet, a novel computational pipeline that combines differential activity analysis based on SJARACNe co-expression networks with PASNet, a biologically informed sparse deep learning model, to perform explainable predictions for COVID-19 severity. The APNet driver-pathway network ingests SJARACNe co-regulation and classification weights to aid result interpretation and hypothesis generation. APNet outperforms alternative models in patient classification across three COVID-19 proteomic datasets, identifying predictive drivers and pathways, including some confirmed in single-cell omics and highlighting under-explored biomarker circuitries in COVID-19.
Availability and implementation: APNet's R, Python scripts, and Cytoscape methodologies are available at https://github.com/BiodataAnalysisGroup/APNet.
{"title":"APNet, an explainable sparse deep learning model to discover differentially active drivers of severe COVID-19.","authors":"George I Gavriilidis, Vasileios Vasileiou, Stella Dimitsaki, Georgios Karakatsoulis, Antonis Giannakakis, Georgios A Pavlopoulos, Fotis Psomopoulos","doi":"10.1093/bioinformatics/btaf063","DOIUrl":"10.1093/bioinformatics/btaf063","url":null,"abstract":"<p><strong>Motivation: </strong>Computational analyses of bulk and single-cell omics provide translational insights into complex diseases, such as COVID-19, by revealing molecules, cellular phenotypes, and signalling patterns that contribute to unfavourable clinical outcomes. Current in silico approaches dovetail differential abundance, biostatistics, and machine learning, but often overlook nonlinear proteomic dynamics, like post-translational modifications, and provide limited biological interpretability beyond feature ranking.</p><p><strong>Results: </strong>We introduce APNet, a novel computational pipeline that combines differential activity analysis based on SJARACNe co-expression networks with PASNet, a biologically informed sparse deep learning model, to perform explainable predictions for COVID-19 severity. The APNet driver-pathway network ingests SJARACNe co-regulation and classification weights to aid result interpretation and hypothesis generation. APNet outperforms alternative models in patient classification across three COVID-19 proteomic datasets, identifying predictive drivers and pathways, including some confirmed in single-cell omics and highlighting under-explored biomarker circuitries in COVID-19.</p><p><strong>Availability and implementation: </strong>APNet's R, Python scripts, and Cytoscape methodologies are available at https://github.com/BiodataAnalysisGroup/APNet.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897427/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143374988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf075
Wei Zou, Yongxin Ji, Jiaojiao Guan, Yanni Sun
Motivation: Plasmids play an essential role in horizontal gene transfer, aiding their host bacteria in acquiring beneficial traits like antibiotic and metal resistance. There exist some plasmids that can transfer, replicate, or persist in multiple organisms. Identifying the relatively complete host range of these plasmids provides insights into how plasmids promote bacterial evolution. To achieve this, we can apply multi-label learning models for plasmid host range prediction. However, there are no databases providing the detailed and complete host labels of these broad-host-range plasmids. Without adequate well-annotated training samples, learning models can fail to extract discriminative feature representations for plasmid host prediction.
Results: To address this problem, we propose a self-correction multi-label learning model called MOSTPLAS. We design a pseudo label learning algorithm and a self-correction asymmetric loss to facilitate the training of multi-label learning model with samples containing some unknown missing labels. We conducted a series of experiments on the NCBI RefSeq plasmid database, the PLSDB 2025 database, plasmids with experimentally determined host labels, the Hi-C dataset, and the DoriC dataset. The benchmark results against other plasmid host range prediction tools demonstrated that MOSTPLAS recognized more host labels while keeping a high precision.
Availability and implementation: MOSTPLAS is implemented with Python, which can be downloaded at https://github.com/wzou96/MOSTPLAS. All relevant data we used in the experiments can be found at https://zenodo.org/doi/10.5281/zenodo.14708999.
{"title":"MOSTPLAS: a self-correction multi-label learning model for plasmid host range prediction.","authors":"Wei Zou, Yongxin Ji, Jiaojiao Guan, Yanni Sun","doi":"10.1093/bioinformatics/btaf075","DOIUrl":"10.1093/bioinformatics/btaf075","url":null,"abstract":"<p><strong>Motivation: </strong>Plasmids play an essential role in horizontal gene transfer, aiding their host bacteria in acquiring beneficial traits like antibiotic and metal resistance. There exist some plasmids that can transfer, replicate, or persist in multiple organisms. Identifying the relatively complete host range of these plasmids provides insights into how plasmids promote bacterial evolution. To achieve this, we can apply multi-label learning models for plasmid host range prediction. However, there are no databases providing the detailed and complete host labels of these broad-host-range plasmids. Without adequate well-annotated training samples, learning models can fail to extract discriminative feature representations for plasmid host prediction.</p><p><strong>Results: </strong>To address this problem, we propose a self-correction multi-label learning model called MOSTPLAS. We design a pseudo label learning algorithm and a self-correction asymmetric loss to facilitate the training of multi-label learning model with samples containing some unknown missing labels. We conducted a series of experiments on the NCBI RefSeq plasmid database, the PLSDB 2025 database, plasmids with experimentally determined host labels, the Hi-C dataset, and the DoriC dataset. The benchmark results against other plasmid host range prediction tools demonstrated that MOSTPLAS recognized more host labels while keeping a high precision.</p><p><strong>Availability and implementation: </strong>MOSTPLAS is implemented with Python, which can be downloaded at https://github.com/wzou96/MOSTPLAS. All relevant data we used in the experiments can be found at https://zenodo.org/doi/10.5281/zenodo.14708999.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897426/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143442775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf092
Joel Selvaraj, Liguo Wang, Jianlin Cheng
Motivation: Cryogenic electron microscopy (cryo-EM) is a core experimental technique used to determine the structure of macromolecules such as proteins. However, the effectiveness of cryo-EM is often hindered by the noise and missing density values in cryo-EM density maps caused by experimental conditions such as low contrast and conformational heterogeneity. Although various global and local map-sharpening techniques are widely employed to improve cryo-EM density maps, it is still challenging to efficiently improve their quality for building better protein structures from them.
Results: In this study, we introduce CryoTEN-a 3D UNETR++ style transformer to improve cryo-EM maps effectively. CryoTEN is trained using a diverse set of 1295 cryo-EM maps as inputs and their corresponding simulated maps generated from known protein structures as targets. An independent test set containing 150 maps is used to evaluate CryoTEN, and the results demonstrate that it can robustly enhance the quality of cryo-EM density maps. In addition, automatic de novo protein structure modeling shows that protein structures built from the density maps processed by CryoTEN have substantially better quality than those built from the original maps. Compared to the existing state-of-the-art deep learning methods for enhancing cryo-EM density maps, CryoTEN ranks second in improving the quality of density maps, while running >10 times faster and requiring much less GPU memory than them.
Availability and implementation: The source code and data are freely available at https://github.com/jianlin-cheng/cryoten.
{"title":"CryoTEN: efficiently enhancing cryo-EM density maps using transformers.","authors":"Joel Selvaraj, Liguo Wang, Jianlin Cheng","doi":"10.1093/bioinformatics/btaf092","DOIUrl":"10.1093/bioinformatics/btaf092","url":null,"abstract":"<p><strong>Motivation: </strong>Cryogenic electron microscopy (cryo-EM) is a core experimental technique used to determine the structure of macromolecules such as proteins. However, the effectiveness of cryo-EM is often hindered by the noise and missing density values in cryo-EM density maps caused by experimental conditions such as low contrast and conformational heterogeneity. Although various global and local map-sharpening techniques are widely employed to improve cryo-EM density maps, it is still challenging to efficiently improve their quality for building better protein structures from them.</p><p><strong>Results: </strong>In this study, we introduce CryoTEN-a 3D UNETR++ style transformer to improve cryo-EM maps effectively. CryoTEN is trained using a diverse set of 1295 cryo-EM maps as inputs and their corresponding simulated maps generated from known protein structures as targets. An independent test set containing 150 maps is used to evaluate CryoTEN, and the results demonstrate that it can robustly enhance the quality of cryo-EM density maps. In addition, automatic de novo protein structure modeling shows that protein structures built from the density maps processed by CryoTEN have substantially better quality than those built from the original maps. Compared to the existing state-of-the-art deep learning methods for enhancing cryo-EM density maps, CryoTEN ranks second in improving the quality of density maps, while running >10 times faster and requiring much less GPU memory than them.</p><p><strong>Availability and implementation: </strong>The source code and data are freely available at https://github.com/jianlin-cheng/cryoten.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11906401/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf107
Julia K Varga, Sergey Ovchinnikov, Ora Schueler-Furman
Summary: One of the main advantages of deep learning models of protein structure, such as Alphafold2, is their ability to accurately estimate the confidence of a generated structural model, which allows us to focus on highly confident predictions. The ipTM score provides a confidence estimate of interchain contacts in protein-protein interactions. However, interactions, in particular motif-mediated interactions, often also contain regions that remain flexible upon binding. These noninteracting flanking regions are assigned low confidence values and will affect ipTM, as it considers all interchain residue-residue pairs, and two models of the same motif-domain interaction, but differing in the length of their flanking regions, would be assigned very different values. Here, we propose actual interface pTM (actifpTM), a modified ipTM measure, that focuses on the residues participating in the interaction, resulting in a more robust measure of interaction confidence. Besides, actifpTM is calculated both for the full complex as well as for each pair of chains, making it well-suited for evaluating multi-chain complexes with a particularly critical binding interface, such as antibody-antigen interactions.
Availability and implementation: The method is available as part of the ColabFold (https://github.com/sokrypton/ColabFold) repository, installable both locally or usable with Colab notebook.
{"title":"actifpTM: a refined confidence metric of AlphaFold2 predictions involving flexible regions.","authors":"Julia K Varga, Sergey Ovchinnikov, Ora Schueler-Furman","doi":"10.1093/bioinformatics/btaf107","DOIUrl":"10.1093/bioinformatics/btaf107","url":null,"abstract":"<p><strong>Summary: </strong>One of the main advantages of deep learning models of protein structure, such as Alphafold2, is their ability to accurately estimate the confidence of a generated structural model, which allows us to focus on highly confident predictions. The ipTM score provides a confidence estimate of interchain contacts in protein-protein interactions. However, interactions, in particular motif-mediated interactions, often also contain regions that remain flexible upon binding. These noninteracting flanking regions are assigned low confidence values and will affect ipTM, as it considers all interchain residue-residue pairs, and two models of the same motif-domain interaction, but differing in the length of their flanking regions, would be assigned very different values. Here, we propose actual interface pTM (actifpTM), a modified ipTM measure, that focuses on the residues participating in the interaction, resulting in a more robust measure of interaction confidence. Besides, actifpTM is calculated both for the full complex as well as for each pair of chains, making it well-suited for evaluating multi-chain complexes with a particularly critical binding interface, such as antibody-antigen interactions.</p><p><strong>Availability and implementation: </strong>The method is available as part of the ColabFold (https://github.com/sokrypton/ColabFold) repository, installable both locally or usable with Colab notebook.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11925850/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143627063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}