Pub Date : 2025-11-28DOI: 10.1186/s12859-025-06307-w
Ramu Gautam, Yang Jiao, Yasong Pang, Mo Weng, Mei Yang
{"title":"ProTrack3D: a comprehensive tool for segmentation and tracking of proteins with split and fusion.","authors":"Ramu Gautam, Yang Jiao, Yasong Pang, Mo Weng, Mei Yang","doi":"10.1186/s12859-025-06307-w","DOIUrl":"10.1186/s12859-025-06307-w","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"4"},"PeriodicalIF":3.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145629068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1186/s12859-025-06315-w
Espoir Kabanga, Seonil Jee, Soeun Yun, Stephen Depuydt, Arnout Van Messem, Wesley De Neve
Background: Splice site prediction in plant genomes poses substantial challenges that can be addressed using deep learning models. U2-type introns are especially useful for such studies given their ubiquity in plant genomes and the availability of rich datasets. We formulated two hypotheses: one proposing that short introns may enhance prediction effectiveness due to reduced spatial complexity, and another suggesting that sequences with multiple introns provide a richer context for splicing events.
Results: Our findings demonstrate that (1) models trained on datasets containing shorter introns achieve improved effectiveness for acceptor splice sites, but not for donor splice sites, indicating a more nuanced relationship between intron length and splice site prediction than initially hypothesized, and (2) models trained on datasets with multiple introns per sequence show higher effectiveness compared to those trained on datasets with a single intron per sequence. Notably, among the 402 bp sequences analyzed, 72% contained single introns while 28% contained multiple introns for donor sites (36,399 versus 13,987 sequences), with similar proportions observed for acceptor sites (37,236 versus 14,112 sequences). These computational insights align with biological observations, particularly regarding the conserved spatial relationship between branch points and acceptor splice sites, as well as the synergistic effects of multiple introns on splicing efficiency.
Conclusions: The obtained results contribute to a deeper understanding of how intronic features influence splice site prediction and suggest that future prediction models should consider factors such as intron length, multiplicity, and the spatial arrangement of splice-related signals.
{"title":"Impact of U2-type introns on splice site prediction in A. thaliana species using deep learning.","authors":"Espoir Kabanga, Seonil Jee, Soeun Yun, Stephen Depuydt, Arnout Van Messem, Wesley De Neve","doi":"10.1186/s12859-025-06315-w","DOIUrl":"10.1186/s12859-025-06315-w","url":null,"abstract":"<p><strong>Background: </strong>Splice site prediction in plant genomes poses substantial challenges that can be addressed using deep learning models. U2-type introns are especially useful for such studies given their ubiquity in plant genomes and the availability of rich datasets. We formulated two hypotheses: one proposing that short introns may enhance prediction effectiveness due to reduced spatial complexity, and another suggesting that sequences with multiple introns provide a richer context for splicing events.</p><p><strong>Results: </strong>Our findings demonstrate that (1) models trained on datasets containing shorter introns achieve improved effectiveness for acceptor splice sites, but not for donor splice sites, indicating a more nuanced relationship between intron length and splice site prediction than initially hypothesized, and (2) models trained on datasets with multiple introns per sequence show higher effectiveness compared to those trained on datasets with a single intron per sequence. Notably, among the 402 bp sequences analyzed, 72% contained single introns while 28% contained multiple introns for donor sites (36,399 versus 13,987 sequences), with similar proportions observed for acceptor sites (37,236 versus 14,112 sequences). These computational insights align with biological observations, particularly regarding the conserved spatial relationship between branch points and acceptor splice sites, as well as the synergistic effects of multiple introns on splicing efficiency.</p><p><strong>Conclusions: </strong>The obtained results contribute to a deeper understanding of how intronic features influence splice site prediction and suggest that future prediction models should consider factors such as intron length, multiplicity, and the spatial arrangement of splice-related signals.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"288"},"PeriodicalIF":3.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664245/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1186/s12859-025-06312-z
Alexandre Chaussard, Anna Bonnet, Sylvain Le Corff, Harry Sokol
Background: The gut microbiome plays a crucial role in human health, making it a cornerstone of modern biomedical research. To study its structure and dynamics, machine learning models are increasingly used to identify key microbial patterns associated with disease and environmental factors, but their performance is often limited by the intrinsic complexity of microbiome data and the small size of available cohorts. In this context, data augmentation has emerged as a promising strategy to overcome these challenges by generating artificial microbiome profiles.
Results: We introduce TaxaPLN, a data augmentation method based on PLN-Tree generative models, which leverages the taxonomy and a data-driven sampler to generate realistic synthetic microbiome compositions. Additionally, we propose a conditional extension based on feature-wise linear modulation, enabling covariate-aware generation. Experiments on diverse curated microbiome datasets show that TaxaPLN preserves ecological properties and generally improves or maintains predictive performances, outperforming state-of-the-art baselines on most tasks. Furthermore, the conditional variant of TaxaPLN establishes a new benchmark for metadata-aware microbiome augmentation.
Conclusion: TaxaPLN provides a model-based framework for augmenting microbiome datasets while preserving their ecological and clinical relevance. By integrating taxonomic structure and host metadata, it enhances predictive modeling across diverse real-world settings. To facilitate reproducible and scalable microbiome analysis using our method, TaxaPLN is released as an open-source Python package available on PyPI (plntree), with MIT-licensed source code hosted at https://github.com/AlexandreChaussard/PLNTree-package .
{"title":"TaxaPLN: a taxonomy-aware augmentation strategy for microbiome-trait classification including metadata.","authors":"Alexandre Chaussard, Anna Bonnet, Sylvain Le Corff, Harry Sokol","doi":"10.1186/s12859-025-06312-z","DOIUrl":"10.1186/s12859-025-06312-z","url":null,"abstract":"<p><strong>Background: </strong>The gut microbiome plays a crucial role in human health, making it a cornerstone of modern biomedical research. To study its structure and dynamics, machine learning models are increasingly used to identify key microbial patterns associated with disease and environmental factors, but their performance is often limited by the intrinsic complexity of microbiome data and the small size of available cohorts. In this context, data augmentation has emerged as a promising strategy to overcome these challenges by generating artificial microbiome profiles.</p><p><strong>Results: </strong>We introduce TaxaPLN, a data augmentation method based on PLN-Tree generative models, which leverages the taxonomy and a data-driven sampler to generate realistic synthetic microbiome compositions. Additionally, we propose a conditional extension based on feature-wise linear modulation, enabling covariate-aware generation. Experiments on diverse curated microbiome datasets show that TaxaPLN preserves ecological properties and generally improves or maintains predictive performances, outperforming state-of-the-art baselines on most tasks. Furthermore, the conditional variant of TaxaPLN establishes a new benchmark for metadata-aware microbiome augmentation.</p><p><strong>Conclusion: </strong>TaxaPLN provides a model-based framework for augmenting microbiome datasets while preserving their ecological and clinical relevance. By integrating taxonomic structure and host metadata, it enhances predictive modeling across diverse real-world settings. To facilitate reproducible and scalable microbiome analysis using our method, TaxaPLN is released as an open-source Python package available on PyPI (plntree), with MIT-licensed source code hosted at https://github.com/AlexandreChaussard/PLNTree-package .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"1"},"PeriodicalIF":3.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12763835/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1186/s12859-025-06304-z
Jinxiong Zhang, Yunjv Zeng, Chunyan Tang, Cheng Zhong, Hao Wen, Yang Liu
Background: Polypharmacy's ability to circumvent acquired resistance to single drug makes it a critical strategy for treating complex diseases. However, it inevitably carries risks of drug-drug interactions (DDIs) that may alter pharmacological activities and potentially lead to severe adverse events or mortality. Computational assessment of drug combination has emerged as an effective approach to support clinical decision-making. Current risk identification methods focus on mining historical interaction patterns to uncover underlying mechanisms, yet face challenges from data sparsity. While data augmentation strategy can mitigate such problem, conventional approaches often introduce noise that obscures core pharmacological mechanisms, undermining safety evaluation.
Results: This study proposes a Multi-Mechanism Disentangled Drug-drug Interaction assessment framework integrated contrastive learning, MMDDI, which includes two key components: (1) biologically-informed multi-view generation that creates high-quality augmented views, effectively addressing semantic distortion during data augmentation; (2) Mechanism-aware disentanglement that incorporates mutual information constraints to isolate interaction mechanisms from coupling of multi-modal and heterogeneous data, eliminating quantification bias. Contrastive learning integrates labeled and unlabeled data to enhance robustness against sparse observations.
Conclusions: Comprehensive evaluations demonstrate that MMDDI with hit@4 of 0.86 outperforms the compared baselines, with ablation studies validating the critical contributions of multi-view contrastive and mechanism disentanglement. MMDDI continues to demonstrate excellent performance in cold-start scenarios, achieving accuracy of 0.94 and recall of 0.95. Clinically, MMDDI enables interpretable causal analysis of drug interaction pathways through its mechanism-aware representations, providing operability for optimizing therapeutic regimens.
{"title":"Contrastive learning-based multi-mechanism disentangled assessment for drug-drug interaction.","authors":"Jinxiong Zhang, Yunjv Zeng, Chunyan Tang, Cheng Zhong, Hao Wen, Yang Liu","doi":"10.1186/s12859-025-06304-z","DOIUrl":"10.1186/s12859-025-06304-z","url":null,"abstract":"<p><strong>Background: </strong>Polypharmacy's ability to circumvent acquired resistance to single drug makes it a critical strategy for treating complex diseases. However, it inevitably carries risks of drug-drug interactions (DDIs) that may alter pharmacological activities and potentially lead to severe adverse events or mortality. Computational assessment of drug combination has emerged as an effective approach to support clinical decision-making. Current risk identification methods focus on mining historical interaction patterns to uncover underlying mechanisms, yet face challenges from data sparsity. While data augmentation strategy can mitigate such problem, conventional approaches often introduce noise that obscures core pharmacological mechanisms, undermining safety evaluation.</p><p><strong>Results: </strong>This study proposes a Multi-Mechanism Disentangled Drug-drug Interaction assessment framework integrated contrastive learning, MMDDI, which includes two key components: (1) biologically-informed multi-view generation that creates high-quality augmented views, effectively addressing semantic distortion during data augmentation; (2) Mechanism-aware disentanglement that incorporates mutual information constraints to isolate interaction mechanisms from coupling of multi-modal and heterogeneous data, eliminating quantification bias. Contrastive learning integrates labeled and unlabeled data to enhance robustness against sparse observations.</p><p><strong>Conclusions: </strong>Comprehensive evaluations demonstrate that MMDDI with hit@4 of 0.86 outperforms the compared baselines, with ablation studies validating the critical contributions of multi-view contrastive and mechanism disentanglement. MMDDI continues to demonstrate excellent performance in cold-start scenarios, achieving accuracy of 0.94 and recall of 0.95. Clinically, MMDDI enables interpretable causal analysis of drug interaction pathways through its mechanism-aware representations, providing operability for optimizing therapeutic regimens.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"286"},"PeriodicalIF":3.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12659097/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1186/s12859-025-06311-0
Christopher T Boughter, Budha Chatterjee, Yuko Ohta, Katrina Gorga, Carly Blair, Elizabeth M Hill, Zachary Fasana, Adedola O Adebamowo, Farah Ammar, Ivan Kosik, Vel Murugan, Wilbur H Chen, Nevil J Singh, Martin Meier-Schellersheim
Background: Declining sequencing costs coupled with the increasing availability of easy-to-use kits for the isolation of DNA and RNA transcripts from single cells have driven a rapid proliferation of studies centered around genomic and transcriptomic data. Simultaneously, a wealth of new techniques have been developed that utilize single cell technologies to interrogate a broad range of cell-biological processes. One recently developed technique, transposase-accessible chromatin with sequencing (ATAC) with select antigen profiling by sequencing (ASAPseq), provides a combination of chromatin accessibility assessments with measurements of cell-surface marker expression levels. While software exists for the characterization of these datasets, there currently exists no tool explicitly designed to reformat ASAP surface marker FASTQ data into a count matrix which can then be used for these downstream analyses.
Results: To address this lack of a dedicated tool for ASAPseq data processing, we created CountASAP, an easy-to-use Python package purposefully designed to transform FASTQ files from ASAP experiments into count matrices compatible with commonly-used downstream bioinformatic analysis packages. CountASAP takes advantage of the independence of the relevant data structures to perform fully parallelized matches of each sequenced read to user-supplied input ASAP oligos and unique cell-identifier sequences. We directly compare the performance and user-friendliness of CountASAP to existing tools using similarly-structured data from a more common sequencing experiment: cellular indexing of transcriptomes and epitopes by sequencing (CITEseq). Further benchmarking against existing tools helps to identify proper defaults for CountASAP and assess the agreement of outputs from all tested software. A final test using a novel ASAPseq dataset provides evidence that CountASAP can generate biologically meaningful results that correlate well with paired chromatin accessibility data.
Conclusions: CountASAP shows good agreement with existing, well-tested data processing tools in the analysis of similarly-structured benchmarking data. CountASAP runs efficiently on a standard laptop, has user-friendly documentation, a one-step installation, and represents the first and only tool designed specifically for the processing of ASAPseq data.
{"title":"CountASAP: a lightweight, easy to use python package for processing ASAPseq data.","authors":"Christopher T Boughter, Budha Chatterjee, Yuko Ohta, Katrina Gorga, Carly Blair, Elizabeth M Hill, Zachary Fasana, Adedola O Adebamowo, Farah Ammar, Ivan Kosik, Vel Murugan, Wilbur H Chen, Nevil J Singh, Martin Meier-Schellersheim","doi":"10.1186/s12859-025-06311-0","DOIUrl":"10.1186/s12859-025-06311-0","url":null,"abstract":"<p><strong>Background: </strong>Declining sequencing costs coupled with the increasing availability of easy-to-use kits for the isolation of DNA and RNA transcripts from single cells have driven a rapid proliferation of studies centered around genomic and transcriptomic data. Simultaneously, a wealth of new techniques have been developed that utilize single cell technologies to interrogate a broad range of cell-biological processes. One recently developed technique, transposase-accessible chromatin with sequencing (ATAC) with select antigen profiling by sequencing (ASAPseq), provides a combination of chromatin accessibility assessments with measurements of cell-surface marker expression levels. While software exists for the characterization of these datasets, there currently exists no tool explicitly designed to reformat ASAP surface marker FASTQ data into a count matrix which can then be used for these downstream analyses.</p><p><strong>Results: </strong>To address this lack of a dedicated tool for ASAPseq data processing, we created CountASAP, an easy-to-use Python package purposefully designed to transform FASTQ files from ASAP experiments into count matrices compatible with commonly-used downstream bioinformatic analysis packages. CountASAP takes advantage of the independence of the relevant data structures to perform fully parallelized matches of each sequenced read to user-supplied input ASAP oligos and unique cell-identifier sequences. We directly compare the performance and user-friendliness of CountASAP to existing tools using similarly-structured data from a more common sequencing experiment: cellular indexing of transcriptomes and epitopes by sequencing (CITEseq). Further benchmarking against existing tools helps to identify proper defaults for CountASAP and assess the agreement of outputs from all tested software. A final test using a novel ASAPseq dataset provides evidence that CountASAP can generate biologically meaningful results that correlate well with paired chromatin accessibility data.</p><p><strong>Conclusions: </strong>CountASAP shows good agreement with existing, well-tested data processing tools in the analysis of similarly-structured benchmarking data. CountASAP runs efficiently on a standard laptop, has user-friendly documentation, a one-step installation, and represents the first and only tool designed specifically for the processing of ASAPseq data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"307"},"PeriodicalIF":3.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751127/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1186/s12859-025-06292-0
Francesca Longhin, Giacomo Baruzzo, Enidia Hazizaj, Diego Boscarino, Dino Paladin, Barbara Di Camillo
{"title":"MOV&RSim: computational modelling of cancer-specific variants and sequencing reads characteristics for realistic tumoral sample simulation.","authors":"Francesca Longhin, Giacomo Baruzzo, Enidia Hazizaj, Diego Boscarino, Dino Paladin, Barbara Di Camillo","doi":"10.1186/s12859-025-06292-0","DOIUrl":"10.1186/s12859-025-06292-0","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"287"},"PeriodicalIF":3.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12659228/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1186/s12859-025-06327-6
Muhammad Ali Nayeem, Shehab Sarar Ahmed, Suliman Aladhadh, M Sohel Rahman
Background: The discovery of DNA motifs is essential for studying gene expression and function in many biological systems. Most existing algorithms for motif detection rely on a single optimization criterion or objective function. This study formulates motif finding as a bi-objective optimization problem and investigates whether multi-objective metaheuristics offer potential advantages over single-objective approaches.
Results: We developed four variants of the Non-dominated Sorting Genetic Algorithm II (NSGA-II) incorporating simple, problem-specific genetic operators. Experiments on six benchmark datasets from three organisms demonstrate that our bi-objective approach significantly outperforms the state-of-the-art Artificial Bee Colony (ABC) metaheuristic. Remarkably, NSGA-II-PMC achieved superior performance over ABC using 6 times fewer fitness evaluations, highlighting its computational efficiency. The synergistic combination of problem-specific operators proved essential, with individual operators showing limited effectiveness compared to their joint application.
Conclusions: Our findings question the common belief that single-objective metaheuristics are better suited for combinatorial problems like motif finding. The bi-objective formulation helps maintain diversity and avoid premature convergence, even with partially correlated objectives, resulting in better solutions than those obtained through dedicated single-objective optimization. Simple, interpretable problem-specific adaptations can yield substantial performance gains over sophisticated alternatives. These results suggest that bi-objective approaches may provide more robust and computationally efficient solutions for DNA motif discovery, opening new research directions in bioinformatics.
{"title":"Revisiting motif finding: do bi-objective metaheuristics surpass single-objective metaheuristics?","authors":"Muhammad Ali Nayeem, Shehab Sarar Ahmed, Suliman Aladhadh, M Sohel Rahman","doi":"10.1186/s12859-025-06327-6","DOIUrl":"10.1186/s12859-025-06327-6","url":null,"abstract":"<p><strong>Background: </strong>The discovery of DNA motifs is essential for studying gene expression and function in many biological systems. Most existing algorithms for motif detection rely on a single optimization criterion or objective function. This study formulates motif finding as a bi-objective optimization problem and investigates whether multi-objective metaheuristics offer potential advantages over single-objective approaches.</p><p><strong>Results: </strong>We developed four variants of the Non-dominated Sorting Genetic Algorithm II (NSGA-II) incorporating simple, problem-specific genetic operators. Experiments on six benchmark datasets from three organisms demonstrate that our bi-objective approach significantly outperforms the state-of-the-art Artificial Bee Colony (ABC) metaheuristic. Remarkably, NSGA-II-PMC achieved superior performance over ABC using 6 times fewer fitness evaluations, highlighting its computational efficiency. The synergistic combination of problem-specific operators proved essential, with individual operators showing limited effectiveness compared to their joint application.</p><p><strong>Conclusions: </strong>Our findings question the common belief that single-objective metaheuristics are better suited for combinatorial problems like motif finding. The bi-objective formulation helps maintain diversity and avoid premature convergence, even with partially correlated objectives, resulting in better solutions than those obtained through dedicated single-objective optimization. Simple, interpretable problem-specific adaptations can yield substantial performance gains over sophisticated alternatives. These results suggest that bi-objective approaches may provide more robust and computationally efficient solutions for DNA motif discovery, opening new research directions in bioinformatics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"291"},"PeriodicalIF":3.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12713252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145629038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1186/s12859-025-06294-y
William Wheeler, Difei Wang, Isaac Zhao, Filipa F Vale, Yumi Jin, Charles S Rabkin
Background: Helicobacter pylori is a highly diverse gastric bacterium whose genomic variation both reflects human migration and complicates genome-wide association studies (GWAS). Its 1.67 Mb genome contains ~ 143,000 biallelic SNPs with minor allele frequency > 1%, making population stratification a major confounder. Existing model- and distance-based methods for bacterial ancestry classification often yield inconsistent results depending on dataset composition. A robust and generalizable framework is needed to improve downstream analyses.
Results: We developed GrafGen, an open-source R package adapted from the human ancestry tool GrafPop, for the classification of H. pylori and prophage populations. Using reference data from the H. pylori Genome Project (1,011 genomes from 50 countries), GrafGen identified nine distinct bacterial populations and four prophage groups by genetic distance clustering. Validation with 255 GenBank sequences showed consistent mapping to GrafGen-defined populations. Classifications based on subsets of 14,300 and 1,430 SNPs achieved > 97% and > 90% concordance, respectively, with those using the full 143,000 SNPs, demonstrating robustness to down-sampling. The package integrates visualization tools for geometric interpretation of ancestry structure and is distributed via Bioconductor (v1.4.0, nine-population reference) and GitHub (v2.0_beta, general framework for haploid species and prophages).
Conclusions: GrafGen provides a reliable approach for classifying H. pylori ancestry and correcting for bacterial population stratification in GWAS. By enabling more accurate inference of genotype-phenotype associations, the method enhances studies of bacterial genetics and host-pathogen interactions. The underlying algorithm is extensible to other haploid organisms with adequate reference data, broadening its relevance beyond H. pylori.
{"title":"GrafGen: distance-based inference of population ancestry for Helicobacter pylori genomes.","authors":"William Wheeler, Difei Wang, Isaac Zhao, Filipa F Vale, Yumi Jin, Charles S Rabkin","doi":"10.1186/s12859-025-06294-y","DOIUrl":"10.1186/s12859-025-06294-y","url":null,"abstract":"<p><strong>Background: </strong>Helicobacter pylori is a highly diverse gastric bacterium whose genomic variation both reflects human migration and complicates genome-wide association studies (GWAS). Its 1.67 Mb genome contains ~ 143,000 biallelic SNPs with minor allele frequency > 1%, making population stratification a major confounder. Existing model- and distance-based methods for bacterial ancestry classification often yield inconsistent results depending on dataset composition. A robust and generalizable framework is needed to improve downstream analyses.</p><p><strong>Results: </strong>We developed GrafGen, an open-source R package adapted from the human ancestry tool GrafPop, for the classification of H. pylori and prophage populations. Using reference data from the H. pylori Genome Project (1,011 genomes from 50 countries), GrafGen identified nine distinct bacterial populations and four prophage groups by genetic distance clustering. Validation with 255 GenBank sequences showed consistent mapping to GrafGen-defined populations. Classifications based on subsets of 14,300 and 1,430 SNPs achieved > 97% and > 90% concordance, respectively, with those using the full 143,000 SNPs, demonstrating robustness to down-sampling. The package integrates visualization tools for geometric interpretation of ancestry structure and is distributed via Bioconductor (v1.4.0, nine-population reference) and GitHub (v2.0_beta, general framework for haploid species and prophages).</p><p><strong>Conclusions: </strong>GrafGen provides a reliable approach for classifying H. pylori ancestry and correcting for bacterial population stratification in GWAS. By enabling more accurate inference of genotype-phenotype associations, the method enhances studies of bacterial genetics and host-pathogen interactions. The underlying algorithm is extensible to other haploid organisms with adequate reference data, broadening its relevance beyond H. pylori.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"308"},"PeriodicalIF":3.3,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751694/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1186/s12859-025-06286-y
Kumar Thurimella, Ahmed M T Mohamed, Chenhao Li, Tommi Vatanen, Daniel B Graham, Róisín M Owens, Sabina Leanti La Rosa, Damian R Plichta, Sergio Bacallado, Ramnik J Xavier
Background: The functional annotation of uncharacterized microbial enzymes from metagenomic data remains a significant challenge, limiting our understanding of microbial metabolic dynamics. Traditional annotation methods often rely on sequence homology, which can fail to identify remote homologs or enzymes with structural rather than sequence conservation. To address this gap, we developed CAZyLingua, the first annotation tool to use protein language models (pLMs) for the accurate classification of carbohydrate-active enzyme (CAZyme) families and subfamilies.
Results: CAZyLingua demonstrated high performance, maintaining precision and recall comparable to state-of-the-art hidden Markov model-based methods while outperforming purely sequence-based approaches. When applied to a metagenomic gene catalog from mother/infant pairs, CAZyLingua identified over 27,000 putative CAZymes missed by other tools, including horizontally-transferred enzymes implicated in infant microbiome development. In datasets from patients with Crohn's disease and IgG4-related disease, CAZyLinuga uncovered disease-associated CAZymes, highlighting an expansion of carbohydrate esterases (CEs) in IgG4-related disease. A CE17 enzyme predicted to be overabundant in Crohn's disease was functionally validated, confirming its catalytic activity on acetylated manno-oligosaccharides.
Conclusions: CAZyLingua is a powerful tool that effectively augments existing functional annotation pipelines for CAZymes. By leveraging the deep contextual information captured by pLMs, our method can uncover novel CAZyme diversity and reveal enzymatic functions relevant to health and disease, contributing to a further understanding of biological processes related to host health and nutrition.
{"title":"Protein language models uncover carbohydrate-active enzyme function in metagenomics.","authors":"Kumar Thurimella, Ahmed M T Mohamed, Chenhao Li, Tommi Vatanen, Daniel B Graham, Róisín M Owens, Sabina Leanti La Rosa, Damian R Plichta, Sergio Bacallado, Ramnik J Xavier","doi":"10.1186/s12859-025-06286-y","DOIUrl":"10.1186/s12859-025-06286-y","url":null,"abstract":"<p><strong>Background: </strong>The functional annotation of uncharacterized microbial enzymes from metagenomic data remains a significant challenge, limiting our understanding of microbial metabolic dynamics. Traditional annotation methods often rely on sequence homology, which can fail to identify remote homologs or enzymes with structural rather than sequence conservation. To address this gap, we developed CAZyLingua, the first annotation tool to use protein language models (pLMs) for the accurate classification of carbohydrate-active enzyme (CAZyme) families and subfamilies.</p><p><strong>Results: </strong>CAZyLingua demonstrated high performance, maintaining precision and recall comparable to state-of-the-art hidden Markov model-based methods while outperforming purely sequence-based approaches. When applied to a metagenomic gene catalog from mother/infant pairs, CAZyLingua identified over 27,000 putative CAZymes missed by other tools, including horizontally-transferred enzymes implicated in infant microbiome development. In datasets from patients with Crohn's disease and IgG4-related disease, CAZyLinuga uncovered disease-associated CAZymes, highlighting an expansion of carbohydrate esterases (CEs) in IgG4-related disease. A CE17 enzyme predicted to be overabundant in Crohn's disease was functionally validated, confirming its catalytic activity on acetylated manno-oligosaccharides.</p><p><strong>Conclusions: </strong>CAZyLingua is a powerful tool that effectively augments existing functional annotation pipelines for CAZymes. By leveraging the deep contextual information captured by pLMs, our method can uncover novel CAZyme diversity and reveal enzymatic functions relevant to health and disease, contributing to a further understanding of biological processes related to host health and nutrition.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"285"},"PeriodicalIF":3.3,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12659350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1186/s12859-025-06303-0
Niharika Saraf, Vishvesh Karthik, Gaurav Sharma
Background: The AlphaFoldDB Structure Extractor ( https://project.iith.ac.in/sharmaglab/alphafoldextractor/ ) is an open-access web server and API toolkit designed to facilitate the bulk download of predicted protein structures from the AlphaFold Database using well-known accession formats. Addressing the current limitations in extracting structures beyond a restricted list of model organisms and a threshold number, this tool accepts diverse sequence and structure input identifiers, such as NCBI Taxonomy ID, RefSeq accessions, locus tags (old and new), and UniProt or AlphaFold accessions for structure retrieval.
Results: Users can download structure files in PDB, mmCIF, bCIF, or/and PAE JSON formats using any of the above-mentioned input accessions as input. The tool also generates an accompanying ID mapping file to trace input identifiers back to standard accession numbers and reports unmapped IDs separately. Users can also perform just the ID mapping in case they do not require the structure coordinate files. An API methodology is also provided for programmatic access, enabling integration into bioinformatics pipelines. We have tested the tool using several randomly selected accessions (individual inputs and up to 5000 input accessions) of each type from NCBI RefSeq and Taxonomy Databases, UniProt Database and AlphaFold Database.
Conclusions: Overall, AlphaFoldDB Structure Extractor streamlines the structure procurement process from AlphaFold database, empowering researchers in structural and functional genomics with minimal computational expertise.
{"title":"AlphaFold Database Structure Extractor: a web server and API to download AlphaFold structures using common protein accessions.","authors":"Niharika Saraf, Vishvesh Karthik, Gaurav Sharma","doi":"10.1186/s12859-025-06303-0","DOIUrl":"10.1186/s12859-025-06303-0","url":null,"abstract":"<p><strong>Background: </strong>The AlphaFoldDB Structure Extractor ( https://project.iith.ac.in/sharmaglab/alphafoldextractor/ ) is an open-access web server and API toolkit designed to facilitate the bulk download of predicted protein structures from the AlphaFold Database using well-known accession formats. Addressing the current limitations in extracting structures beyond a restricted list of model organisms and a threshold number, this tool accepts diverse sequence and structure input identifiers, such as NCBI Taxonomy ID, RefSeq accessions, locus tags (old and new), and UniProt or AlphaFold accessions for structure retrieval.</p><p><strong>Results: </strong>Users can download structure files in PDB, mmCIF, bCIF, or/and PAE JSON formats using any of the above-mentioned input accessions as input. The tool also generates an accompanying ID mapping file to trace input identifiers back to standard accession numbers and reports unmapped IDs separately. Users can also perform just the ID mapping in case they do not require the structure coordinate files. An API methodology is also provided for programmatic access, enabling integration into bioinformatics pipelines. We have tested the tool using several randomly selected accessions (individual inputs and up to 5000 input accessions) of each type from NCBI RefSeq and Taxonomy Databases, UniProt Database and AlphaFold Database.</p><p><strong>Conclusions: </strong>Overall, AlphaFoldDB Structure Extractor streamlines the structure procurement process from AlphaFold database, empowering researchers in structural and functional genomics with minimal computational expertise.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"305"},"PeriodicalIF":3.3,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145602065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}