Pub Date : 2025-10-24eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1674791
Shrudhi Devi, Gurunathan Jayaraman
Introduction: Neurodegenerative diseases pose significant challenges owing to the limited number of effective therapies. Nerve growth factor (NGF) plays a crucial role in neuronal survival and differentiation through tropomyosin receptor kinase A (TrkA). Although snake venom NGF (sNGF) has been studied for its ability to activate TrkA, the binding modes and associated dynamics remain unclear compared to those of human NGF (hNGF). Herein, we explored the possibilities of NGFs from Daboia russelii and Naja naja as potential therapeutic alternatives to hNGF by comparing the structural similarities and conserved binding residues.
Methods: The active sites were identified through a literature review, molecular docking was performed using HADDOCK, and molecular dynamics simulation was performed to analyse the stabilities of the complexes; then, PRODIGY and molecular mechanics Poisson-Boltzmann surface area were used to determine the binding affinities.
Results: The different sNGFs exhibited stronger binding affinities and stabilities than hNGF, while principal component analysis and the free energy landscape indicated constrained conformational flexibilities suggestive of an adaptive mechanism in sNGF for effective receptor engagement. A network coevolutionary analysis was performed, which showed the pattern in which the amino acids were coevolved and conserved throughout the simulations.
Discussion: These findings indicate that NGFs from D. russelii and N. naja are promising therapeutic candidates for treating neurodegenerative disorders and warrant further in vivo validation.
{"title":"Unraveling the molecular basis of snake venom nerve growth factor: human TrkA recognition through molecular dynamics simulation and comparison with human nerve growth factor.","authors":"Shrudhi Devi, Gurunathan Jayaraman","doi":"10.3389/fbinf.2025.1674791","DOIUrl":"10.3389/fbinf.2025.1674791","url":null,"abstract":"<p><strong>Introduction: </strong>Neurodegenerative diseases pose significant challenges owing to the limited number of effective therapies. Nerve growth factor (NGF) plays a crucial role in neuronal survival and differentiation through tropomyosin receptor kinase A (TrkA). Although snake venom NGF (sNGF) has been studied for its ability to activate TrkA, the binding modes and associated dynamics remain unclear compared to those of human NGF (hNGF). Herein, we explored the possibilities of NGFs from <i>Daboia russelii</i> and <i>Naja naja</i> as potential therapeutic alternatives to hNGF by comparing the structural similarities and conserved binding residues.</p><p><strong>Methods: </strong>The active sites were identified through a literature review, molecular docking was performed using HADDOCK, and molecular dynamics simulation was performed to analyse the stabilities of the complexes; then, PRODIGY and molecular mechanics Poisson-Boltzmann surface area were used to determine the binding affinities.</p><p><strong>Results: </strong>The different sNGFs exhibited stronger binding affinities and stabilities than hNGF, while principal component analysis and the free energy landscape indicated constrained conformational flexibilities suggestive of an adaptive mechanism in sNGF for effective receptor engagement. A network coevolutionary analysis was performed, which showed the pattern in which the amino acids were coevolved and conserved throughout the simulations.</p><p><strong>Discussion: </strong>These findings indicate that NGFs from <i>D. russelii</i> and <i>N. naja</i> are promising therapeutic candidates for treating neurodegenerative disorders and warrant further <i>in vivo</i> validation.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1674791"},"PeriodicalIF":3.9,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12592128/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1666716
Daiana Colibăşanu, Vlad Groza, Maria Antonietta Occhiuzzi, Fedora Grande, Mihai Udrescu, Lucreția Udrescu
Introduction: Drug repositioning-finding new therapeutic uses for existing drugs-can dramatically reduce development time and cost, but requires efficient computational frameworks to generate and validate repositioning hypotheses. Network-based methods can uncover drug communities with shared pharmacological properties, while molecular docking offers mechanistic insights by predicting drug-target binding.
Methods: We introduce an end-to-end, fully automated pipeline that (1) constructs a tripartite drug-gene-disease network from DrugBank and DisGeNET, (2) projects it into a drug-drug similarity network for community detection, (3) labels communities via Anatomical Therapeutic Chemical (ATC) codes to generate repositioning hints and identify relevant targets, (4) validates hints through automated literature searches, and (5) prioritizes candidates via targeted molecular docking.
Results: After filtering for connectivity and size, 12 robust communities emerged from the initial 34 clusters. The pipeline correctly matched 53.4% of drugs to their ATC level 1 community label via database entries; literature validation confirmed an additional 20.2%, yielding 73.6% overall accuracy. The remaining 26.4% of drugs were flagged as repositioning candidates. To illustrate the advantages of our pipeline, molecular docking studies of chloramphenicol demonstrated stable binding and interaction profiles similar to those of known inhibitors, reinforcing its potential as an anticancer agent.
Conclusion: Our integrated pipeline effectively integrates network-based community analysis and automated ATC labeling with literature and docking analysis, narrowing the search space for in silico and experimental follow-up. The chloramphenicol example illustrates its utility for uncovering non-obvious repositioning opportunities. Future work will extend similarity definitions (e.g., to higher-order network motifs) and incorporate wet-lab validation of top candidates.
{"title":"Drug repositioning pipeline integrating community analysis in drug-drug similarity networks and automated ATC community labeling to foster molecular docking analysis.","authors":"Daiana Colibăşanu, Vlad Groza, Maria Antonietta Occhiuzzi, Fedora Grande, Mihai Udrescu, Lucreția Udrescu","doi":"10.3389/fbinf.2025.1666716","DOIUrl":"10.3389/fbinf.2025.1666716","url":null,"abstract":"<p><strong>Introduction: </strong>Drug repositioning-finding new therapeutic uses for existing drugs-can dramatically reduce development time and cost, but requires efficient computational frameworks to generate and validate repositioning hypotheses. Network-based methods can uncover drug communities with shared pharmacological properties, while molecular docking offers mechanistic insights by predicting drug-target binding.</p><p><strong>Methods: </strong>We introduce an end-to-end, fully automated pipeline that (1) constructs a tripartite drug-gene-disease network from DrugBank and DisGeNET, (2) projects it into a drug-drug similarity network for community detection, (3) labels communities <i>via</i> Anatomical Therapeutic Chemical (ATC) codes to generate repositioning hints and identify relevant targets, (4) validates hints through automated literature searches, and (5) prioritizes candidates <i>via</i> targeted molecular docking.</p><p><strong>Results: </strong>After filtering for connectivity and size, 12 robust communities emerged from the initial 34 clusters. The pipeline correctly matched 53.4% of drugs to their ATC level 1 community label <i>via</i> database entries; literature validation confirmed an additional 20.2%, yielding 73.6% overall accuracy. The remaining 26.4% of drugs were flagged as repositioning candidates. To illustrate the advantages of our pipeline, molecular docking studies of chloramphenicol demonstrated stable binding and interaction profiles similar to those of known inhibitors, reinforcing its potential as an anticancer agent.</p><p><strong>Conclusion: </strong>Our integrated pipeline effectively integrates network-based community analysis and automated ATC labeling with literature and docking analysis, narrowing the search space for <i>in silico</i> and experimental follow-up. The chloramphenicol example illustrates its utility for uncovering non-obvious repositioning opportunities. Future work will extend similarity definitions (e.g., to higher-order network motifs) and incorporate wet-lab validation of top candidates.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1666716"},"PeriodicalIF":3.9,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12589059/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1651623
Islam Akef Ebeid, Haoteng Tang, Pengfei Gu
Introduction: Accurate prediction of protein-protein interactions (PPIs) is crucial for understanding cellular functions and advancing the development of drugs. While existing in-silico methods leverage direct sequence embeddings from Protein Language Models (PLMs) or apply Graph Neural Networks (GNNs) to 3D protein structures, the main focus of this study is to investigate less computationally intensive alternatives. This work introduces a novel framework for the downstream task of PPI prediction via link prediction.
Methods: We introduce a two-stage graph representation learning framework, ProtGram-DirectGCN. First, we developed ProtGram, a novel approach that models a protein's primary structure as a hierarchy of globally inferred n-gram graphs. In these graphs, residue transition probabilities, aggregated from a large sequence corpus, define the edge weights of a directed graph of paired residues. Second, we propose a custom directed graph convolutional neural network, DirectGCN, which features a unique convolutional layer that processes information through separate path-specific (incoming, outgoing, undirected) and shared transformations, combined via a learnable gating mechanism. DirectGCN is applied to the ProtGram graphs to learn residue-level embeddings, which are then pooled via an attention mechanism to generate protein-level embeddings for the prediction task.
Results: The efficacy of the DirectGCN model was first established on standard node classification benchmarks, where its performance is comparable to that of established methods on general datasets, while demonstrating specialization for complex, directed, and dense heterophilic graph structures. When applied to PPI prediction, the full ProtGram-DirectGCN framework achieves robust predictive power despite being trained on limited data.
Discussion: Our results suggest that a globally inferred, directed graph-based representation of sequence transitions offers a potent and computationally distinct alternative to resource-intensive PLMs for the task of PPI prediction. Future work will involve testing ProtGram-DirectGCN on a wider range of bioinformatics tasks.
{"title":"Inferred global dense residue transition graphs from primary structure sequences enable protein interaction prediction via directed graph convolutional neural networks.","authors":"Islam Akef Ebeid, Haoteng Tang, Pengfei Gu","doi":"10.3389/fbinf.2025.1651623","DOIUrl":"10.3389/fbinf.2025.1651623","url":null,"abstract":"<p><strong>Introduction: </strong>Accurate prediction of protein-protein interactions (PPIs) is crucial for understanding cellular functions and advancing the development of drugs. While existing <i>in-silico</i> methods leverage direct sequence embeddings from Protein Language Models (PLMs) or apply Graph Neural Networks (GNNs) to 3D protein structures, the main focus of this study is to investigate less computationally intensive alternatives. This work introduces a novel framework for the downstream task of PPI prediction via link prediction.</p><p><strong>Methods: </strong>We introduce a two-stage graph representation learning framework, <i>ProtGram-DirectGCN</i>. First, we developed <i>ProtGram</i>, a novel approach that models a protein's primary structure as a hierarchy of globally inferred n-gram graphs. In these graphs, residue transition probabilities, aggregated from a large sequence corpus, define the edge weights of a directed graph of paired residues. Second, we propose a custom directed graph convolutional neural network, <i>DirectGCN</i>, which features a unique convolutional layer that processes information through separate path-specific (incoming, outgoing, undirected) and shared transformations, combined via a learnable gating mechanism. <i>DirectGCN</i> is applied to the <i>ProtGram</i> graphs to learn residue-level embeddings, which are then pooled via an attention mechanism to generate protein-level embeddings for the prediction task.</p><p><strong>Results: </strong>The efficacy of the <i>DirectGCN</i> model was first established on standard node classification benchmarks, where its performance is comparable to that of established methods on general datasets, while demonstrating specialization for complex, directed, and dense heterophilic graph structures. When applied to PPI prediction, the full <i>ProtGram-DirectGCN</i> framework achieves robust predictive power despite being trained on limited data.</p><p><strong>Discussion: </strong>Our results suggest that a globally inferred, directed graph-based representation of sequence transitions offers a potent and computationally distinct alternative to resource-intensive PLMs for the task of PPI prediction. Future work will involve testing <i>ProtGram-DirectGCN</i> on a wider range of bioinformatics tasks.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1651623"},"PeriodicalIF":3.9,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12585958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145460749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-21eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1649337
Qihuan Yao, Zhen Chen, Ye Cao, Huijing Hu
Introduction: Accurately predicting drug-target interactions (DTIs) is crucial for accelerating drug discovery and repurposing. Despite recent advances in deep learning-based methods, challenges remain in effectively capturing the complex relationships between drugs and targets while incorporating prior biological knowledge.
Methods: We introduce a novel framework that combines graph neural networks with knowledge integration for DTI prediction. Our approach learns representations from molecular structures and protein sequences through a customized graph-based message passing scheme. We integrate domain knowledge from biomedical ontologies and databases using a knowledge-based regularization strategy to infuse biological context into the learned representations.
Results: We evaluated our model on multiple benchmark datasets, achieving an average AUC of 0.98 and an average AUPR of 0.89, surpassing existing state-of-the-art methods by a considerable margin. Visualization of learned attention weights identified salient molecular substructures and protein motifs driving the predicted interactions, demonstrating model interpretability.
Discussion: We validated the practical utility by predicting novel DTIs for FDA-approved drugs and experimentally confirming a high proportion of predictions. Our framework offers a powerful and interpretable solution for DTI prediction with the potential to substantially accelerate the identification of new drug candidates and therapeutic targets.
{"title":"Enhancing drug-target interaction prediction with graph representation learning and knowledge-based regularization.","authors":"Qihuan Yao, Zhen Chen, Ye Cao, Huijing Hu","doi":"10.3389/fbinf.2025.1649337","DOIUrl":"10.3389/fbinf.2025.1649337","url":null,"abstract":"<p><strong>Introduction: </strong>Accurately predicting drug-target interactions (DTIs) is crucial for accelerating drug discovery and repurposing. Despite recent advances in deep learning-based methods, challenges remain in effectively capturing the complex relationships between drugs and targets while incorporating prior biological knowledge.</p><p><strong>Methods: </strong>We introduce a novel framework that combines graph neural networks with knowledge integration for DTI prediction. Our approach learns representations from molecular structures and protein sequences through a customized graph-based message passing scheme. We integrate domain knowledge from biomedical ontologies and databases using a knowledge-based regularization strategy to infuse biological context into the learned representations.</p><p><strong>Results: </strong>We evaluated our model on multiple benchmark datasets, achieving an average AUC of 0.98 and an average AUPR of 0.89, surpassing existing state-of-the-art methods by a considerable margin. Visualization of learned attention weights identified salient molecular substructures and protein motifs driving the predicted interactions, demonstrating model interpretability.</p><p><strong>Discussion: </strong>We validated the practical utility by predicting novel DTIs for FDA-approved drugs and experimentally confirming a high proportion of predictions. Our framework offers a powerful and interpretable solution for DTI prediction with the potential to substantially accelerate the identification of new drug candidates and therapeutic targets.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1649337"},"PeriodicalIF":3.9,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12583218/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145454181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-20eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1669237
Vachiranee Limviphuvadh, Thimo Ruethers, Minh N Nguyen, Dean R Jerry, Benjamin P C Smith, Yulan Wang, Yansong Miao, Anand Kumar Andiappan, Andreas L Lopata, Sebastian Maurer-Stroh
Introduction: Fish is a major food allergy trigger with a complex variety of allergenic protein isoforms and vast species diversity exhibiting variable allergenicity. This is the first study to systematically compile fish isoallergen and variant entries associated with ingestion-related allergic reactions.
Methods: Entries were compiled from four major allergen databases: World Health Organization and International Union of Immunological Societies (WHO/IUIS), AllergenOnline, Comprehensive Protein Allergen Resource (COMPARE), and Allergome, including evidence from in vitro IgE-binding assays and complete amino acid sequences. Challenges in predicting the allergenicity of fish isoallergens and variants were evaluated, and the sensitivity of five widely used in silico tools (AllerCatPro 2.0, AlgPred 2.0, pLM4Alg, AllergenFP v.1.0, and AllerTop v.2.0) was assessed. Epitope mapping and phylogenetic analyses were performed for the major fish allergen parvalbumin, incorporating experimentally validated B-cell epitope data from the Immune Epitope Database (IEDB) and evolutionary relationships.
Results: A comprehensive dataset of 79 unique fish isoallergen and variant entries from 34 fish species was identified, with 25 entries common across all four databases. AllerCatPro 2.0 achieved the highest sensitivity (97.5%). A phylogenetic tree was constructed, integrating epitope data to optimize protein family-specific thresholds for differentiating allergenic from less/non-allergenic parvalbumins. A threshold of ≥4 IEDB-mapped epitopes allowing up to two mismatches captured 52 out of 54 parvalbumin sequences (96%) in the dataset, effectively distinguishing between parvalbumin classes.
Discussion: This study enhances understanding of fish allergy by systematically compiling fish isoallergens and variants and integrating B-cell epitope data. The optimized thresholds improve the performance of allergenicity prediction tools and can be applied to other protein families in future studies.
{"title":"Fish isoallergens and variants: database compilation, <i>in silico</i> allergenicity prediction challenges, and epitope-based threshold optimization.","authors":"Vachiranee Limviphuvadh, Thimo Ruethers, Minh N Nguyen, Dean R Jerry, Benjamin P C Smith, Yulan Wang, Yansong Miao, Anand Kumar Andiappan, Andreas L Lopata, Sebastian Maurer-Stroh","doi":"10.3389/fbinf.2025.1669237","DOIUrl":"10.3389/fbinf.2025.1669237","url":null,"abstract":"<p><strong>Introduction: </strong>Fish is a major food allergy trigger with a complex variety of allergenic protein isoforms and vast species diversity exhibiting variable allergenicity. This is the first study to systematically compile fish isoallergen and variant entries associated with ingestion-related allergic reactions.</p><p><strong>Methods: </strong>Entries were compiled from four major allergen databases: World Health Organization and International Union of Immunological Societies (WHO/IUIS), AllergenOnline, Comprehensive Protein Allergen Resource (COMPARE), and Allergome, including evidence from <i>in vitro</i> IgE-binding assays and complete amino acid sequences. Challenges in predicting the allergenicity of fish isoallergens and variants were evaluated, and the sensitivity of five widely used <i>in silico</i> tools (AllerCatPro 2.0, AlgPred 2.0, pLM4Alg, AllergenFP v.1.0, and AllerTop v.2.0) was assessed. Epitope mapping and phylogenetic analyses were performed for the major fish allergen parvalbumin, incorporating experimentally validated B-cell epitope data from the Immune Epitope Database (IEDB) and evolutionary relationships.</p><p><strong>Results: </strong>A comprehensive dataset of 79 unique fish isoallergen and variant entries from 34 fish species was identified, with 25 entries common across all four databases. AllerCatPro 2.0 achieved the highest sensitivity (97.5%). A phylogenetic tree was constructed, integrating epitope data to optimize protein family-specific thresholds for differentiating allergenic from less/non-allergenic parvalbumins. A threshold of ≥4 IEDB-mapped epitopes allowing up to two mismatches captured 52 out of 54 parvalbumin sequences (96%) in the dataset, effectively distinguishing between parvalbumin classes.</p><p><strong>Discussion: </strong>This study enhances understanding of fish allergy by systematically compiling fish isoallergens and variants and integrating B-cell epitope data. The optimized thresholds improve the performance of allergenicity prediction tools and can be applied to other protein families in future studies.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1669237"},"PeriodicalIF":3.9,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12580176/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145446599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1622931
Shreyashi Bodaka, Narasaiah Kolliputi
Phage therapy has reemerged as a compelling alternative to antibiotics in treating bacterial infections, especially for superbugs that have developed antibiotic resistance. The challenge in the broader application of phage therapy is identifying host targets for the vast array of uncharacterized phages obtained through next-generation sequencing. We introduce a Composite Model for Phage Host Interaction (CoMPHI) that integrates alignment-based approaches with machine learning. The model generates multiple feature encodings from nucleotide and protein sequences of both phages and hosts. It incorporates alignment scores between phage-phage, phage-host, and host-host pairs, creating a composite prediction framework. During 5-fold cross-validation, CoMPHI achieved Area Under the ROC Curve (AUC-ROC) values of 94-96.7% and accuracies of 92.3-95.1% across taxonomic levels from species to phylum. Comparative analysis showed a 6-8% performance improvement when alignment scores were included. Ablation studies demonstrated that combining nucleotide and protein encodings, along with phage-host, host-host, and phage-phage alignment scores, significantly enhanced prediction accuracy. CoMPHI provides a robust and comprehensive framework for predicting phage-host interactions. By combining sequence features and alignment information, the model advances computational tools that can accelerate the application of phage therapy in modern medicine.
{"title":"CoMPHI: a novel composite machine learning approach utilizing multiple feature representation to predict hosts of bacteriophages.","authors":"Shreyashi Bodaka, Narasaiah Kolliputi","doi":"10.3389/fbinf.2025.1622931","DOIUrl":"10.3389/fbinf.2025.1622931","url":null,"abstract":"<p><p>Phage therapy has reemerged as a compelling alternative to antibiotics in treating bacterial infections, especially for superbugs that have developed antibiotic resistance. The challenge in the broader application of phage therapy is identifying host targets for the vast array of uncharacterized phages obtained through next-generation sequencing. We introduce a Composite Model for Phage Host Interaction (CoMPHI) that integrates alignment-based approaches with machine learning. The model generates multiple feature encodings from nucleotide and protein sequences of both phages and hosts. It incorporates alignment scores between phage-phage, phage-host, and host-host pairs, creating a composite prediction framework. During 5-fold cross-validation, CoMPHI achieved Area Under the ROC Curve (AUC-ROC) values of 94-96.7% and accuracies of 92.3-95.1% across taxonomic levels from species to phylum. Comparative analysis showed a 6-8% performance improvement when alignment scores were included. Ablation studies demonstrated that combining nucleotide and protein encodings, along with phage-host, host-host, and phage-phage alignment scores, significantly enhanced prediction accuracy. CoMPHI provides a robust and comprehensive framework for predicting phage-host interactions. By combining sequence features and alignment information, the model advances computational tools that can accelerate the application of phage therapy in modern medicine.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1622931"},"PeriodicalIF":3.9,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571911/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1687687
Chunhui Xu, Yang Yu, Govardhan Khadakkar, Jiacheng Xie, Dong Xu, Qiuming Yao
Biological databases are essential for providing curated knowledge, but their rigid data structures and restrictive query formats often limit flexible and exploratory user interactions. In the field of plant phosphorylation, manually curated and reviewed data represent only a small portion of the available knowledge, and users often seek information that goes beyond what is provided in structured databases. While large language models (LLMs) like ChatGPT-4o possess extensive contextual knowledge, integrating this capability into bioinformatics tools remains an open challenge. Here, we present a multimodal question-answering widget that integrates ChatGPT-4o with our Plant Protein Phosphorylation Database (P3DB). This system supports natural language queries and dynamic prompt formulation, enabling users to explore phosphorylation events, kinase-substrate relationships, and protein-protein interactions through a global entry. In another application, the widget leverages ChatGPT's image interpretation functionality to extract regulatory pathways and phosphorylation markers from complex scientific figures. To build this widget effectively, we have explored multiple prompt strategies, including one-step, two-step, few-shot, and image-cropping techniques, demonstrating their impact on output accuracy and consistency. In addition, recent multimodal LLMs such as ChatGPT-5 and Gemini 1.5 have demonstrated comparable capabilities and adaptability when applied to our test cases and the developed widgets. Together, our application widget and results highlight the development of the ChatGPT-P3DB integration as a system that enhances user accessibility, enables visual extraction, and extends the current utility of biological knowledgebases through a flexible and adaptive framework. Our "ChatGPT-P3DB" is open-source and can be accessed on GitHub (https://github.com/yao-laboratory/p3db-chat). The frontend interface, "P3DB askAI" web module, can be accessed freely through https://www.p3db.org/ask-ai.
{"title":"Multimodal knowledge expansion widget powered by plant protein phosphorylation database and ChatGPT.","authors":"Chunhui Xu, Yang Yu, Govardhan Khadakkar, Jiacheng Xie, Dong Xu, Qiuming Yao","doi":"10.3389/fbinf.2025.1687687","DOIUrl":"10.3389/fbinf.2025.1687687","url":null,"abstract":"<p><p>Biological databases are essential for providing curated knowledge, but their rigid data structures and restrictive query formats often limit flexible and exploratory user interactions. In the field of plant phosphorylation, manually curated and reviewed data represent only a small portion of the available knowledge, and users often seek information that goes beyond what is provided in structured databases. While large language models (LLMs) like ChatGPT-4o possess extensive contextual knowledge, integrating this capability into bioinformatics tools remains an open challenge. Here, we present a multimodal question-answering widget that integrates ChatGPT-4o with our Plant Protein Phosphorylation Database (P3DB). This system supports natural language queries and dynamic prompt formulation, enabling users to explore phosphorylation events, kinase-substrate relationships, and protein-protein interactions through a global entry. In another application, the widget leverages ChatGPT's image interpretation functionality to extract regulatory pathways and phosphorylation markers from complex scientific figures. To build this widget effectively, we have explored multiple prompt strategies, including one-step, two-step, few-shot, and image-cropping techniques, demonstrating their impact on output accuracy and consistency. In addition, recent multimodal LLMs such as ChatGPT-5 and Gemini 1.5 have demonstrated comparable capabilities and adaptability when applied to our test cases and the developed widgets. Together, our application widget and results highlight the development of the ChatGPT-P3DB integration as a system that enhances user accessibility, enables visual extraction, and extends the current utility of biological knowledgebases through a flexible and adaptive framework. Our \"ChatGPT-P3DB\" is open-source and can be accessed on GitHub (https://github.com/yao-laboratory/p3db-chat). The frontend interface, \"P3DB askAI\" web module, can be accessed freely through https://www.p3db.org/ask-ai.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1687687"},"PeriodicalIF":3.9,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12568720/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1609004
Rafhanah Shazwani Rosli, Mohamed Hadi Habaebi, Md Rafiqul Islam, Mohammed Abdulla Salim Al Hussaini
Introduction: Breast cancer detection using thermal imaging relies on accurate segmentation of the breast region from adjacent body areas. Reliable segmentation is essential to improve the effectiveness of computer-aided diagnosis systems.
Methods: This study evaluated three segmentation models-U-Net, U-Net with Spatial Attention, and U-Net++-using five optimization algorithms (ADAM, NADAM, RMSPROP, SGDM, and ADADELTA). Performance was assessed through k-fold cross-validation with metrics including Intersection over Union (IoU), Dice coefficient, precision, recall, sensitivity, specificity, pixel accuracy, ROC-AUC, PR-AUC, and Grad-CAM heatmaps for qualitative analysis.
Results: The ADAM optimizer consistently outperformed the others, yielding superior accuracy and reduced loss. Among the models, the baseline U-Net, despite being less complex, demonstrated the most effective performance, with precision of 0.9721, recall of 0.9559, specificity of 0.9801, ROC-AUC of 0.9680, and PR-AUC of 0.9472. U-Net also achieved higher robustness in breast region overlap and noise handling compared to its more complex variants. The findings indicate that greater architectural complexity does not necessarily lead to improved outcomes.
Discussion: This research highlights that the original U-Net, when trained with the ADAM optimizer, remains highly effective for breast region segmentation in thermal images. The insights contribute to guiding the selection of suitable deep learning models and optimizers for medical image analysis, with the potential to enhance the efficiency and accuracy of breast cancer diagnosis using thermal imaging.
{"title":"Analysis of breast region segmentation in thermal images using U-Net deep neural network variants.","authors":"Rafhanah Shazwani Rosli, Mohamed Hadi Habaebi, Md Rafiqul Islam, Mohammed Abdulla Salim Al Hussaini","doi":"10.3389/fbinf.2025.1609004","DOIUrl":"10.3389/fbinf.2025.1609004","url":null,"abstract":"<p><strong>Introduction: </strong>Breast cancer detection using thermal imaging relies on accurate segmentation of the breast region from adjacent body areas. Reliable segmentation is essential to improve the effectiveness of computer-aided diagnosis systems.</p><p><strong>Methods: </strong>This study evaluated three segmentation models-U-Net, U-Net with Spatial Attention, and U-Net++-using five optimization algorithms (ADAM, NADAM, RMSPROP, SGDM, and ADADELTA). Performance was assessed through k-fold cross-validation with metrics including Intersection over Union (IoU), Dice coefficient, precision, recall, sensitivity, specificity, pixel accuracy, ROC-AUC, PR-AUC, and Grad-CAM heatmaps for qualitative analysis.</p><p><strong>Results: </strong>The ADAM optimizer consistently outperformed the others, yielding superior accuracy and reduced loss. Among the models, the baseline U-Net, despite being less complex, demonstrated the most effective performance, with precision of 0.9721, recall of 0.9559, specificity of 0.9801, ROC-AUC of 0.9680, and PR-AUC of 0.9472. U-Net also achieved higher robustness in breast region overlap and noise handling compared to its more complex variants. The findings indicate that greater architectural complexity does not necessarily lead to improved outcomes.</p><p><strong>Discussion: </strong>This research highlights that the original U-Net, when trained with the ADAM optimizer, remains highly effective for breast region segmentation in thermal images. The insights contribute to guiding the selection of suitable deep learning models and optimizers for medical image analysis, with the potential to enhance the efficiency and accuracy of breast cancer diagnosis using thermal imaging.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1609004"},"PeriodicalIF":3.9,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12550958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1629526
Ying Liu, Jiayi Cai, Aamir Fahira, Kai Zhuang, Jiaojiao Wang, Zhi Zhang, Lin Yan, Yong Liu, Defang Ouyang, Zunnan Huang
Objective: Triple-negative breast cancer (TNBC), a classic subtype of breast cancer, is challenging to treat due to the lack of drug-targeting receptors. This study aims to explore interferon-related prognostic molecular biomarkers in TNBC and their potential competing endogenous RNA (ceRNA) regulatory network in TNBC.
Methods: RNA expression profiles and interferon genes were downloaded from the Cancer Genome Atlas (TCGA) database and the Gene Set Enrichment Analysis (GSEA) website, respectively. Univariate and multivariate Cox regression analyses were performed to identify prognostic genes and construct a risk model. Single-sample GSEA (ssGSEA) and the CellMiner database were used to explore the relationships between prognostic genes and both tumor immune microenvironment and drug sensitivity, respectively. The lncRNA-miRNA-mRNA network associated with prognosis was constructed using the ENCORI database. Finally, the potential interferon-associated lncRNA/miRNA/mRNA regulatory axis was identified through correlation analysis. The abnormal expressions of prognostic genes were validated in three TNBC tumor cell lines compared to normal mammary epithelial cells by using quantitative real-time polymerase chain reaction (qRT-PCR).
Results: The TNBC prognostic signature comprising four interferon genes (STXBP1, LAMP3, CD276, and POLR2F) was identified, with their expression significantly correlated with the infiltration abundance of multiple immune cells and the drug sensitivity of 30 diverse drugs (ARQ-680, Fluphenazine, and Chelerythrine, etc.). Furthermore, an interferon-related genes prognostic ceRNA network was further constructed, consisting of 248 lncRNAs, 66 miRNAs, and 4 mRNAs. As a result, 5 interferon-related ceRNA regulatory axes (AC124067.4/hsa-miR-455-3p/STXBP1, RBPMS-AS1/hsa-miR-455-3p/STXBP1, DNMBP-AS1/hsa-miR-455-3p/STXBP1, FAM198B-AS1/hsa-miR-455-3p/STXBP1, LIFR-AS1/hsa-miR-455-3p/STXBP1) associated with TNBC progression were identified. QRT-PCR results showed that all four prognostic mRNAs were upregulated in TNBC cells.
Conclusion: This study established a prognostic signature and a ceRNA network associated with interferon in TNBC, and identified five key regulatory axes. In the prognostic signature and the ceRNA axes, STXBP1, RBPMS-AS1, and FAM198B-AS1 were first reported as potential biomarkers of TNBC. These findings have the potential to provide new insights into the mechanisms driving TNBC tumorigenesis and development.
{"title":"Unveiling the impact of interferon genes on the immune microenvironment of triple-negative breast cancer: identification of therapeutic targets.","authors":"Ying Liu, Jiayi Cai, Aamir Fahira, Kai Zhuang, Jiaojiao Wang, Zhi Zhang, Lin Yan, Yong Liu, Defang Ouyang, Zunnan Huang","doi":"10.3389/fbinf.2025.1629526","DOIUrl":"10.3389/fbinf.2025.1629526","url":null,"abstract":"<p><strong>Objective: </strong>Triple-negative breast cancer (TNBC), a classic subtype of breast cancer, is challenging to treat due to the lack of drug-targeting receptors. This study aims to explore interferon-related prognostic molecular biomarkers in TNBC and their potential competing endogenous RNA (ceRNA) regulatory network in TNBC.</p><p><strong>Methods: </strong>RNA expression profiles and interferon genes were downloaded from the Cancer Genome Atlas (TCGA) database and the Gene Set Enrichment Analysis (GSEA) website, respectively. Univariate and multivariate Cox regression analyses were performed to identify prognostic genes and construct a risk model. Single-sample GSEA (ssGSEA) and the CellMiner database were used to explore the relationships between prognostic genes and both tumor immune microenvironment and drug sensitivity, respectively. The lncRNA-miRNA-mRNA network associated with prognosis was constructed using the ENCORI database. Finally, the potential interferon-associated lncRNA/miRNA/mRNA regulatory axis was identified through correlation analysis. The abnormal expressions of prognostic genes were validated in three TNBC tumor cell lines compared to normal mammary epithelial cells by using quantitative real-time polymerase chain reaction (qRT-PCR).</p><p><strong>Results: </strong>The TNBC prognostic signature comprising four interferon genes (STXBP1, LAMP3, CD276, and POLR2F) was identified, with their expression significantly correlated with the infiltration abundance of multiple immune cells and the drug sensitivity of 30 diverse drugs (ARQ-680, Fluphenazine, and Chelerythrine, etc.). Furthermore, an interferon-related genes prognostic ceRNA network was further constructed, consisting of 248 lncRNAs, 66 miRNAs, and 4 mRNAs. As a result, 5 interferon-related ceRNA regulatory axes (AC124067.4/hsa-miR-455-3p/STXBP1, RBPMS-AS1/hsa-miR-455-3p/STXBP1, DNMBP-AS1/hsa-miR-455-3p/STXBP1, FAM198B-AS1/hsa-miR-455-3p/STXBP1, LIFR-AS1/hsa-miR-455-3p/STXBP1) associated with TNBC progression were identified. QRT-PCR results showed that all four prognostic mRNAs were upregulated in TNBC cells.</p><p><strong>Conclusion: </strong>This study established a prognostic signature and a ceRNA network associated with interferon in TNBC, and identified five key regulatory axes. In the prognostic signature and the ceRNA axes, STXBP1, RBPMS-AS1, and FAM198B-AS1 were first reported as potential biomarkers of TNBC. These findings have the potential to provide new insights into the mechanisms driving TNBC tumorigenesis and development.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1629526"},"PeriodicalIF":3.9,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12542738/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR β-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.
{"title":"Optimizing clustering of CDR3 sequences using natural language processing, Word2Vec, and KMeans.","authors":"Sanskriti Baranwal, Ricardo Avila Sanchez, Clement-Andi Edet, Erick Chastain, Inimary Toby","doi":"10.3389/fbinf.2025.1623488","DOIUrl":"10.3389/fbinf.2025.1623488","url":null,"abstract":"<p><p>T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR β-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1623488"},"PeriodicalIF":3.9,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12528129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}