Pub Date : 2025-10-16eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1622931
Shreyashi Bodaka, Narasaiah Kolliputi
Phage therapy has reemerged as a compelling alternative to antibiotics in treating bacterial infections, especially for superbugs that have developed antibiotic resistance. The challenge in the broader application of phage therapy is identifying host targets for the vast array of uncharacterized phages obtained through next-generation sequencing. We introduce a Composite Model for Phage Host Interaction (CoMPHI) that integrates alignment-based approaches with machine learning. The model generates multiple feature encodings from nucleotide and protein sequences of both phages and hosts. It incorporates alignment scores between phage-phage, phage-host, and host-host pairs, creating a composite prediction framework. During 5-fold cross-validation, CoMPHI achieved Area Under the ROC Curve (AUC-ROC) values of 94-96.7% and accuracies of 92.3-95.1% across taxonomic levels from species to phylum. Comparative analysis showed a 6-8% performance improvement when alignment scores were included. Ablation studies demonstrated that combining nucleotide and protein encodings, along with phage-host, host-host, and phage-phage alignment scores, significantly enhanced prediction accuracy. CoMPHI provides a robust and comprehensive framework for predicting phage-host interactions. By combining sequence features and alignment information, the model advances computational tools that can accelerate the application of phage therapy in modern medicine.
{"title":"CoMPHI: a novel composite machine learning approach utilizing multiple feature representation to predict hosts of bacteriophages.","authors":"Shreyashi Bodaka, Narasaiah Kolliputi","doi":"10.3389/fbinf.2025.1622931","DOIUrl":"10.3389/fbinf.2025.1622931","url":null,"abstract":"<p><p>Phage therapy has reemerged as a compelling alternative to antibiotics in treating bacterial infections, especially for superbugs that have developed antibiotic resistance. The challenge in the broader application of phage therapy is identifying host targets for the vast array of uncharacterized phages obtained through next-generation sequencing. We introduce a Composite Model for Phage Host Interaction (CoMPHI) that integrates alignment-based approaches with machine learning. The model generates multiple feature encodings from nucleotide and protein sequences of both phages and hosts. It incorporates alignment scores between phage-phage, phage-host, and host-host pairs, creating a composite prediction framework. During 5-fold cross-validation, CoMPHI achieved Area Under the ROC Curve (AUC-ROC) values of 94-96.7% and accuracies of 92.3-95.1% across taxonomic levels from species to phylum. Comparative analysis showed a 6-8% performance improvement when alignment scores were included. Ablation studies demonstrated that combining nucleotide and protein encodings, along with phage-host, host-host, and phage-phage alignment scores, significantly enhanced prediction accuracy. CoMPHI provides a robust and comprehensive framework for predicting phage-host interactions. By combining sequence features and alignment information, the model advances computational tools that can accelerate the application of phage therapy in modern medicine.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1622931"},"PeriodicalIF":3.9,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571911/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1687687
Chunhui Xu, Yang Yu, Govardhan Khadakkar, Jiacheng Xie, Dong Xu, Qiuming Yao
Biological databases are essential for providing curated knowledge, but their rigid data structures and restrictive query formats often limit flexible and exploratory user interactions. In the field of plant phosphorylation, manually curated and reviewed data represent only a small portion of the available knowledge, and users often seek information that goes beyond what is provided in structured databases. While large language models (LLMs) like ChatGPT-4o possess extensive contextual knowledge, integrating this capability into bioinformatics tools remains an open challenge. Here, we present a multimodal question-answering widget that integrates ChatGPT-4o with our Plant Protein Phosphorylation Database (P3DB). This system supports natural language queries and dynamic prompt formulation, enabling users to explore phosphorylation events, kinase-substrate relationships, and protein-protein interactions through a global entry. In another application, the widget leverages ChatGPT's image interpretation functionality to extract regulatory pathways and phosphorylation markers from complex scientific figures. To build this widget effectively, we have explored multiple prompt strategies, including one-step, two-step, few-shot, and image-cropping techniques, demonstrating their impact on output accuracy and consistency. In addition, recent multimodal LLMs such as ChatGPT-5 and Gemini 1.5 have demonstrated comparable capabilities and adaptability when applied to our test cases and the developed widgets. Together, our application widget and results highlight the development of the ChatGPT-P3DB integration as a system that enhances user accessibility, enables visual extraction, and extends the current utility of biological knowledgebases through a flexible and adaptive framework. Our "ChatGPT-P3DB" is open-source and can be accessed on GitHub (https://github.com/yao-laboratory/p3db-chat). The frontend interface, "P3DB askAI" web module, can be accessed freely through https://www.p3db.org/ask-ai.
{"title":"Multimodal knowledge expansion widget powered by plant protein phosphorylation database and ChatGPT.","authors":"Chunhui Xu, Yang Yu, Govardhan Khadakkar, Jiacheng Xie, Dong Xu, Qiuming Yao","doi":"10.3389/fbinf.2025.1687687","DOIUrl":"10.3389/fbinf.2025.1687687","url":null,"abstract":"<p><p>Biological databases are essential for providing curated knowledge, but their rigid data structures and restrictive query formats often limit flexible and exploratory user interactions. In the field of plant phosphorylation, manually curated and reviewed data represent only a small portion of the available knowledge, and users often seek information that goes beyond what is provided in structured databases. While large language models (LLMs) like ChatGPT-4o possess extensive contextual knowledge, integrating this capability into bioinformatics tools remains an open challenge. Here, we present a multimodal question-answering widget that integrates ChatGPT-4o with our Plant Protein Phosphorylation Database (P3DB). This system supports natural language queries and dynamic prompt formulation, enabling users to explore phosphorylation events, kinase-substrate relationships, and protein-protein interactions through a global entry. In another application, the widget leverages ChatGPT's image interpretation functionality to extract regulatory pathways and phosphorylation markers from complex scientific figures. To build this widget effectively, we have explored multiple prompt strategies, including one-step, two-step, few-shot, and image-cropping techniques, demonstrating their impact on output accuracy and consistency. In addition, recent multimodal LLMs such as ChatGPT-5 and Gemini 1.5 have demonstrated comparable capabilities and adaptability when applied to our test cases and the developed widgets. Together, our application widget and results highlight the development of the ChatGPT-P3DB integration as a system that enhances user accessibility, enables visual extraction, and extends the current utility of biological knowledgebases through a flexible and adaptive framework. Our \"ChatGPT-P3DB\" is open-source and can be accessed on GitHub (https://github.com/yao-laboratory/p3db-chat). The frontend interface, \"P3DB askAI\" web module, can be accessed freely through https://www.p3db.org/ask-ai.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1687687"},"PeriodicalIF":3.9,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12568720/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1609004
Rafhanah Shazwani Rosli, Mohamed Hadi Habaebi, Md Rafiqul Islam, Mohammed Abdulla Salim Al Hussaini
Introduction: Breast cancer detection using thermal imaging relies on accurate segmentation of the breast region from adjacent body areas. Reliable segmentation is essential to improve the effectiveness of computer-aided diagnosis systems.
Methods: This study evaluated three segmentation models-U-Net, U-Net with Spatial Attention, and U-Net++-using five optimization algorithms (ADAM, NADAM, RMSPROP, SGDM, and ADADELTA). Performance was assessed through k-fold cross-validation with metrics including Intersection over Union (IoU), Dice coefficient, precision, recall, sensitivity, specificity, pixel accuracy, ROC-AUC, PR-AUC, and Grad-CAM heatmaps for qualitative analysis.
Results: The ADAM optimizer consistently outperformed the others, yielding superior accuracy and reduced loss. Among the models, the baseline U-Net, despite being less complex, demonstrated the most effective performance, with precision of 0.9721, recall of 0.9559, specificity of 0.9801, ROC-AUC of 0.9680, and PR-AUC of 0.9472. U-Net also achieved higher robustness in breast region overlap and noise handling compared to its more complex variants. The findings indicate that greater architectural complexity does not necessarily lead to improved outcomes.
Discussion: This research highlights that the original U-Net, when trained with the ADAM optimizer, remains highly effective for breast region segmentation in thermal images. The insights contribute to guiding the selection of suitable deep learning models and optimizers for medical image analysis, with the potential to enhance the efficiency and accuracy of breast cancer diagnosis using thermal imaging.
{"title":"Analysis of breast region segmentation in thermal images using U-Net deep neural network variants.","authors":"Rafhanah Shazwani Rosli, Mohamed Hadi Habaebi, Md Rafiqul Islam, Mohammed Abdulla Salim Al Hussaini","doi":"10.3389/fbinf.2025.1609004","DOIUrl":"10.3389/fbinf.2025.1609004","url":null,"abstract":"<p><strong>Introduction: </strong>Breast cancer detection using thermal imaging relies on accurate segmentation of the breast region from adjacent body areas. Reliable segmentation is essential to improve the effectiveness of computer-aided diagnosis systems.</p><p><strong>Methods: </strong>This study evaluated three segmentation models-U-Net, U-Net with Spatial Attention, and U-Net++-using five optimization algorithms (ADAM, NADAM, RMSPROP, SGDM, and ADADELTA). Performance was assessed through k-fold cross-validation with metrics including Intersection over Union (IoU), Dice coefficient, precision, recall, sensitivity, specificity, pixel accuracy, ROC-AUC, PR-AUC, and Grad-CAM heatmaps for qualitative analysis.</p><p><strong>Results: </strong>The ADAM optimizer consistently outperformed the others, yielding superior accuracy and reduced loss. Among the models, the baseline U-Net, despite being less complex, demonstrated the most effective performance, with precision of 0.9721, recall of 0.9559, specificity of 0.9801, ROC-AUC of 0.9680, and PR-AUC of 0.9472. U-Net also achieved higher robustness in breast region overlap and noise handling compared to its more complex variants. The findings indicate that greater architectural complexity does not necessarily lead to improved outcomes.</p><p><strong>Discussion: </strong>This research highlights that the original U-Net, when trained with the ADAM optimizer, remains highly effective for breast region segmentation in thermal images. The insights contribute to guiding the selection of suitable deep learning models and optimizers for medical image analysis, with the potential to enhance the efficiency and accuracy of breast cancer diagnosis using thermal imaging.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1609004"},"PeriodicalIF":3.9,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12550958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1629526
Ying Liu, Jiayi Cai, Aamir Fahira, Kai Zhuang, Jiaojiao Wang, Zhi Zhang, Lin Yan, Yong Liu, Defang Ouyang, Zunnan Huang
Objective: Triple-negative breast cancer (TNBC), a classic subtype of breast cancer, is challenging to treat due to the lack of drug-targeting receptors. This study aims to explore interferon-related prognostic molecular biomarkers in TNBC and their potential competing endogenous RNA (ceRNA) regulatory network in TNBC.
Methods: RNA expression profiles and interferon genes were downloaded from the Cancer Genome Atlas (TCGA) database and the Gene Set Enrichment Analysis (GSEA) website, respectively. Univariate and multivariate Cox regression analyses were performed to identify prognostic genes and construct a risk model. Single-sample GSEA (ssGSEA) and the CellMiner database were used to explore the relationships between prognostic genes and both tumor immune microenvironment and drug sensitivity, respectively. The lncRNA-miRNA-mRNA network associated with prognosis was constructed using the ENCORI database. Finally, the potential interferon-associated lncRNA/miRNA/mRNA regulatory axis was identified through correlation analysis. The abnormal expressions of prognostic genes were validated in three TNBC tumor cell lines compared to normal mammary epithelial cells by using quantitative real-time polymerase chain reaction (qRT-PCR).
Results: The TNBC prognostic signature comprising four interferon genes (STXBP1, LAMP3, CD276, and POLR2F) was identified, with their expression significantly correlated with the infiltration abundance of multiple immune cells and the drug sensitivity of 30 diverse drugs (ARQ-680, Fluphenazine, and Chelerythrine, etc.). Furthermore, an interferon-related genes prognostic ceRNA network was further constructed, consisting of 248 lncRNAs, 66 miRNAs, and 4 mRNAs. As a result, 5 interferon-related ceRNA regulatory axes (AC124067.4/hsa-miR-455-3p/STXBP1, RBPMS-AS1/hsa-miR-455-3p/STXBP1, DNMBP-AS1/hsa-miR-455-3p/STXBP1, FAM198B-AS1/hsa-miR-455-3p/STXBP1, LIFR-AS1/hsa-miR-455-3p/STXBP1) associated with TNBC progression were identified. QRT-PCR results showed that all four prognostic mRNAs were upregulated in TNBC cells.
Conclusion: This study established a prognostic signature and a ceRNA network associated with interferon in TNBC, and identified five key regulatory axes. In the prognostic signature and the ceRNA axes, STXBP1, RBPMS-AS1, and FAM198B-AS1 were first reported as potential biomarkers of TNBC. These findings have the potential to provide new insights into the mechanisms driving TNBC tumorigenesis and development.
{"title":"Unveiling the impact of interferon genes on the immune microenvironment of triple-negative breast cancer: identification of therapeutic targets.","authors":"Ying Liu, Jiayi Cai, Aamir Fahira, Kai Zhuang, Jiaojiao Wang, Zhi Zhang, Lin Yan, Yong Liu, Defang Ouyang, Zunnan Huang","doi":"10.3389/fbinf.2025.1629526","DOIUrl":"10.3389/fbinf.2025.1629526","url":null,"abstract":"<p><strong>Objective: </strong>Triple-negative breast cancer (TNBC), a classic subtype of breast cancer, is challenging to treat due to the lack of drug-targeting receptors. This study aims to explore interferon-related prognostic molecular biomarkers in TNBC and their potential competing endogenous RNA (ceRNA) regulatory network in TNBC.</p><p><strong>Methods: </strong>RNA expression profiles and interferon genes were downloaded from the Cancer Genome Atlas (TCGA) database and the Gene Set Enrichment Analysis (GSEA) website, respectively. Univariate and multivariate Cox regression analyses were performed to identify prognostic genes and construct a risk model. Single-sample GSEA (ssGSEA) and the CellMiner database were used to explore the relationships between prognostic genes and both tumor immune microenvironment and drug sensitivity, respectively. The lncRNA-miRNA-mRNA network associated with prognosis was constructed using the ENCORI database. Finally, the potential interferon-associated lncRNA/miRNA/mRNA regulatory axis was identified through correlation analysis. The abnormal expressions of prognostic genes were validated in three TNBC tumor cell lines compared to normal mammary epithelial cells by using quantitative real-time polymerase chain reaction (qRT-PCR).</p><p><strong>Results: </strong>The TNBC prognostic signature comprising four interferon genes (STXBP1, LAMP3, CD276, and POLR2F) was identified, with their expression significantly correlated with the infiltration abundance of multiple immune cells and the drug sensitivity of 30 diverse drugs (ARQ-680, Fluphenazine, and Chelerythrine, etc.). Furthermore, an interferon-related genes prognostic ceRNA network was further constructed, consisting of 248 lncRNAs, 66 miRNAs, and 4 mRNAs. As a result, 5 interferon-related ceRNA regulatory axes (AC124067.4/hsa-miR-455-3p/STXBP1, RBPMS-AS1/hsa-miR-455-3p/STXBP1, DNMBP-AS1/hsa-miR-455-3p/STXBP1, FAM198B-AS1/hsa-miR-455-3p/STXBP1, LIFR-AS1/hsa-miR-455-3p/STXBP1) associated with TNBC progression were identified. QRT-PCR results showed that all four prognostic mRNAs were upregulated in TNBC cells.</p><p><strong>Conclusion: </strong>This study established a prognostic signature and a ceRNA network associated with interferon in TNBC, and identified five key regulatory axes. In the prognostic signature and the ceRNA axes, STXBP1, RBPMS-AS1, and FAM198B-AS1 were first reported as potential biomarkers of TNBC. These findings have the potential to provide new insights into the mechanisms driving TNBC tumorigenesis and development.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1629526"},"PeriodicalIF":3.9,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12542738/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR β-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.
{"title":"Optimizing clustering of CDR3 sequences using natural language processing, Word2Vec, and KMeans.","authors":"Sanskriti Baranwal, Ricardo Avila Sanchez, Clement-Andi Edet, Erick Chastain, Inimary Toby","doi":"10.3389/fbinf.2025.1623488","DOIUrl":"10.3389/fbinf.2025.1623488","url":null,"abstract":"<p><p>T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR β-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1623488"},"PeriodicalIF":3.9,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12528129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-30eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1514880
Lara Vázquez-González, Carlos Peña-Reyes, Alba Regueira-Iglesias, Carlos Balsa-Castro, Inmaculada Tomás, María J Carreira
One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at https://gitlab.citius.usc.es/lara.vazquez/epheclass.
{"title":"EPheClass: ensemble-based phenotype classifier from 16S rRNA gene sequences.","authors":"Lara Vázquez-González, Carlos Peña-Reyes, Alba Regueira-Iglesias, Carlos Balsa-Castro, Inmaculada Tomás, María J Carreira","doi":"10.3389/fbinf.2025.1514880","DOIUrl":"10.3389/fbinf.2025.1514880","url":null,"abstract":"<p><p>One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at https://gitlab.citius.usc.es/lara.vazquez/epheclass.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1514880"},"PeriodicalIF":3.9,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12518240/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145304801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-30eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1642039
Micaela Villacrés, Alec Avila, Karina Jimenes-Vargas, António Machado, José M Alvarez-Suarez, Eduardo Tejera
Background: Gastric cancer (GC) remains a major global health burden despite advances in diagnosis and treatment. In recent years, natural products have gained increasing attention as promising sources of anticancer agents, including GC.
Methods: In this study, we applied an in silico ensemble-based modeling strategy to predict compounds with potential inhibitory effects against four GC-related cell lines: AGS, NCI-N87, BGC-823, and SNU-16. Individual predictive models were developed using several algorithms and further integrated into two consensus ensemble multi-objective models. A comprehensive database of over 100,000 natural compounds from 21,665 plant species, was screened for validation and to identify potential molecular candidates.
Results: The ensemble models demonstrated a 12-15-fold improvement in identifying active molecules compared to random selection. A total of 340 molecules were prioritized, many belonging to bioactive classes such as taxane diterpenoids, flavonoids, isoflavonoids, phloroglucinols, and tryptophan alkaloids. Known anticancer compounds, including paclitaxel, orsaponin (OSW-1), glycybenzofuran, and glyurallin A, were successfully retrieved, reinforcing the validity of the approach. Species from the genera Taxus, Glycyrrhiza, Elaphoglossum, and Seseli emerged as particularly relevant sources of bioactive candidates.
Conclusion: While some genera, such as Taxus and Glycyrrhiza, have well-documented anticancer properties, others, including Elaphoglossum and Seseli, require further experimental validation. These findings highlight the potential of combining multi-objectives ensemble modeling with natural product databases to discover novel phytochemicals relevant to GC treatment.
{"title":"Discovering molecules and plants with potential activity against gastric cancer: an <i>in silico</i> ensemble-based modeling analysis.","authors":"Micaela Villacrés, Alec Avila, Karina Jimenes-Vargas, António Machado, José M Alvarez-Suarez, Eduardo Tejera","doi":"10.3389/fbinf.2025.1642039","DOIUrl":"10.3389/fbinf.2025.1642039","url":null,"abstract":"<p><strong>Background: </strong>Gastric cancer (GC) remains a major global health burden despite advances in diagnosis and treatment. In recent years, natural products have gained increasing attention as promising sources of anticancer agents, including GC.</p><p><strong>Methods: </strong>In this study, we applied an <i>in silico</i> ensemble-based modeling strategy to predict compounds with potential inhibitory effects against four GC-related cell lines: AGS, NCI-N87, BGC-823, and SNU-16. Individual predictive models were developed using several algorithms and further integrated into two consensus ensemble multi-objective models. A comprehensive database of over 100,000 natural compounds from 21,665 plant species, was screened for validation and to identify potential molecular candidates.</p><p><strong>Results: </strong>The ensemble models demonstrated a 12-15-fold improvement in identifying active molecules compared to random selection. A total of 340 molecules were prioritized, many belonging to bioactive classes such as taxane diterpenoids, flavonoids, isoflavonoids, phloroglucinols, and tryptophan alkaloids. Known anticancer compounds, including paclitaxel, orsaponin (OSW-1), glycybenzofuran, and glyurallin A, were successfully retrieved, reinforcing the validity of the approach. Species from the genera <i>Taxus</i>, <i>Glycyrrhiza</i>, <i>Elaphoglossum</i>, and <i>Seseli</i> emerged as particularly relevant sources of bioactive candidates.</p><p><strong>Conclusion: </strong>While some genera, such as <i>Taxus</i> and <i>Glycyrrhiza</i>, have well-documented anticancer properties, others, including <i>Elaphoglossum</i> and <i>Seseli</i>, require further experimental validation. These findings highlight the potential of combining multi-objectives ensemble modeling with natural product databases to discover novel phytochemicals relevant to GC treatment.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1642039"},"PeriodicalIF":3.9,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12518311/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145304800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-25eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1664576
Xiangjun Wang, Panpan Jin, Juan Xu, Junyi Li, Mengzhen Ji
Background: Oral squamous cell carcinoma (OSCC) represents a significant global health challenge, with betel nut consumption being a major risk factor. 3-(methylnitrosamino)propionitrile (MNPN), a betel nut-derived nitrosamine, has been identified as a potential carcinogen, but its molecular targets in OSCC pathogenesis remain poorly understood.
Methods: We employed a comprehensive computational framework integrating target prediction, transcriptomic analysis, weighted gene co-expression network analysis (WGCNA), and machine learning approaches. Four OSCC datasets from Gene Expression Omnibus (GEO) were analyzed, and MNPN targets were predicted using ChEMBL, PharmMapper, and SwissTargetPrediction databases. Machine learning algorithms (n = 127 combinations) were evaluated for optimal biomarker identification, with model interpretability assessed using SHAP (SHapley Additive exPlanations) analysis.
Results: Target prediction identified 881 potential MNPN targets across three databases. WGCNA revealed 534 OSCC-associated differentially expressed genes, with 38 overlapping MNPN targets. Machine learning optimization identified 13 hub genes, with PLAU demonstrating the highest predictive performance (AUC = 0.944). SHAP analysis confirmed PLAU and PLOD3 as the most influential contributors to disease prediction. Functional enrichment analysis revealed MNPN targets' involvement in xenobiotic response, hypoxic conditions, and aberrant tissue remodeling.
Conclusion: This study provides the first comprehensive molecular characterization of MNPN-associated OSCC pathogenesis, identifying PLAU as a critical therapeutic target with exceptional diagnostic potential. Our findings establish a foundation for developing targeted interventions for betel nut nitrosamine-associated oral cancers and demonstrate the power of integrative computational approaches in environmental carcinogen research.
背景:口腔鳞状细胞癌(OSCC)是一个重大的全球健康挑战,槟榔是一个主要的危险因素。3-(甲基亚硝胺)丙腈(MNPN)是一种源自槟榔的亚硝胺,已被确定为一种潜在的致癌物,但其在OSCC发病机制中的分子靶点尚不清楚。方法:我们采用了一个综合的计算框架,整合了目标预测、转录组学分析、加权基因共表达网络分析(WGCNA)和机器学习方法。分析来自Gene Expression Omnibus (GEO)的4个OSCC数据集,并使用ChEMBL、PharmMapper和SwissTargetPrediction数据库预测MNPN靶点。评估机器学习算法(n = 127个组合)以确定最佳生物标志物,并使用SHapley加性解释(SHapley Additive explanation)分析评估模型的可解释性。结果:目标预测在三个数据库中确定了881个潜在的MNPN目标。WGCNA共发现534个oscc相关差异表达基因,其中38个MNPN靶点重叠。机器学习优化识别出13个轮毂基因,其中PLAU的预测性能最高(AUC = 0.944)。SHAP分析证实PLAU和PLOD3是预测疾病最具影响力的因子。功能富集分析显示MNPN靶点参与异种生物反应、缺氧条件和异常组织重塑。结论:本研究首次提供了mnpn相关OSCC发病机制的全面分子特征,确定了PLAU是具有特殊诊断潜力的关键治疗靶点。我们的研究结果为开发针对槟榔亚硝胺相关口腔癌的靶向干预奠定了基础,并展示了综合计算方法在环境致癌物研究中的力量。
{"title":"Integrative machine learning and transcriptomic analysis identifies key molecular targets in MNPN-associated oral squamous cell carcinoma pathogenesis.","authors":"Xiangjun Wang, Panpan Jin, Juan Xu, Junyi Li, Mengzhen Ji","doi":"10.3389/fbinf.2025.1664576","DOIUrl":"10.3389/fbinf.2025.1664576","url":null,"abstract":"<p><strong>Background: </strong>Oral squamous cell carcinoma (OSCC) represents a significant global health challenge, with betel nut consumption being a major risk factor. 3-(methylnitrosamino)propionitrile (MNPN), a betel nut-derived nitrosamine, has been identified as a potential carcinogen, but its molecular targets in OSCC pathogenesis remain poorly understood.</p><p><strong>Methods: </strong>We employed a comprehensive computational framework integrating target prediction, transcriptomic analysis, weighted gene co-expression network analysis (WGCNA), and machine learning approaches. Four OSCC datasets from Gene Expression Omnibus (GEO) were analyzed, and MNPN targets were predicted using ChEMBL, PharmMapper, and SwissTargetPrediction databases. Machine learning algorithms (n = 127 combinations) were evaluated for optimal biomarker identification, with model interpretability assessed using SHAP (SHapley Additive exPlanations) analysis.</p><p><strong>Results: </strong>Target prediction identified 881 potential MNPN targets across three databases. WGCNA revealed 534 OSCC-associated differentially expressed genes, with 38 overlapping MNPN targets. Machine learning optimization identified 13 hub genes, with PLAU demonstrating the highest predictive performance (AUC = 0.944). SHAP analysis confirmed PLAU and PLOD3 as the most influential contributors to disease prediction. Functional enrichment analysis revealed MNPN targets' involvement in xenobiotic response, hypoxic conditions, and aberrant tissue remodeling.</p><p><strong>Conclusion: </strong>This study provides the first comprehensive molecular characterization of MNPN-associated OSCC pathogenesis, identifying PLAU as a critical therapeutic target with exceptional diagnostic potential. Our findings establish a foundation for developing targeted interventions for betel nut nitrosamine-associated oral cancers and demonstrate the power of integrative computational approaches in environmental carcinogen research.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1664576"},"PeriodicalIF":3.9,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12508658/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145282010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-24eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1666573
Mohammed Alrouji, Mohammed S Alshammari, Sharif Alhajlah, Syed Tasqeeruddin, Khuzin Dinislam, Anas Shamsi, Saleha Anwar
Cathepsin S (CathS) is a cysteine protease known to play a role in extracellular matrix (ECM) re-modelling, antigen presentation, immune cells polarisation, and cancer progression and chronic pain pathophysiology. CathS also causes an immunosuppressive environment in solid tumors and is involved in nociceptive signaling. Although several small-molecule inhibitors with favorable in vivo properties have been developed, their clinical utility is limited due to resistance, off-target effects, and suboptimal efficacy. Therefore, alternative therapeutic strategies are urgently needed. In the present study, we utilized an integrated virtual screening protocol to screen 3,500 commercially available FDA-approved drug molecules from DrugBank against the CathS crystal structure, based on which drug-likeness profile and interaction studies were performed to filter putative candidates. Alectinib was found to be a top hit and had significant interactions with the important active-site residues His278 and Cys139. PASS predictions suggested relevant anticancer and anti-pain activities for Alectinib in reference to the control inhibitor Q1N. Later, 500-ns molecular dynamics simulations under the CHARMM36 condition revealed that the CathS-Alectinib complex maintained its structural stability, as indicated by conformational parameters, hydrogen-bond persistence, and essential dynamics analyses. Further MM-PBSA calculations also confirmed a favorable binding free energy (ΔG -20.16 ± 2.59 kcal/mol) dominated by the van der Waals and electrostatic contributions. These computational findings suggest that Alectinib may have potential as a repurposed CathS inhibitor, warranting further experimental testing in relevant cancer and chronic pain models. Notably, these results are based solely on computational analysis and require empirical validation.
{"title":"Computational drug repurposing reveals Alectinib as a potential lead targeting Cathepsin S for therapeutic developments against cancer and chronic pain.","authors":"Mohammed Alrouji, Mohammed S Alshammari, Sharif Alhajlah, Syed Tasqeeruddin, Khuzin Dinislam, Anas Shamsi, Saleha Anwar","doi":"10.3389/fbinf.2025.1666573","DOIUrl":"10.3389/fbinf.2025.1666573","url":null,"abstract":"<p><p>Cathepsin S (CathS) is a cysteine protease known to play a role in extracellular matrix (ECM) re-modelling, antigen presentation, immune cells polarisation, and cancer progression and chronic pain pathophysiology. CathS also causes an immunosuppressive environment in solid tumors and is involved in nociceptive signaling. Although several small-molecule inhibitors with favorable <i>in vivo</i> properties have been developed, their clinical utility is limited due to resistance, off-target effects, and suboptimal efficacy. Therefore, alternative therapeutic strategies are urgently needed. In the present study, we utilized an integrated virtual screening protocol to screen 3,500 commercially available FDA-approved drug molecules from DrugBank against the CathS crystal structure, based on which drug-likeness profile and interaction studies were performed to filter putative candidates. Alectinib was found to be a top hit and had significant interactions with the important active-site residues His278 and Cys139. PASS predictions suggested relevant anticancer and anti-pain activities for Alectinib in reference to the control inhibitor Q1N. Later, 500-ns molecular dynamics simulations under the CHARMM36 condition revealed that the CathS-Alectinib complex maintained its structural stability, as indicated by conformational parameters, hydrogen-bond persistence, and essential dynamics analyses. Further MM-PBSA calculations also confirmed a favorable binding free energy (Δ<i>G</i> -20.16 ± 2.59 kcal/mol) dominated by the van der Waals and electrostatic contributions. These computational findings suggest that Alectinib may have potential as a repurposed CathS inhibitor, warranting further experimental testing in relevant cancer and chronic pain models. Notably, these results are based solely on computational analysis and require empirical validation.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1666573"},"PeriodicalIF":3.9,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12504298/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145260089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1645785
Baptiste Bauvin, Thibaud Godon, Guillaume Bachelot, Claudia Carpentier, Riikka Huusaari, Maxime Deraspe, Juho Rousu, Caroline Quach, Jacques Corbeil
Introduction: The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinical, proteomic, and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery.
Methods: As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures.
Results: Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable.
Discussion: This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.
{"title":"Extracting a COVID-19 signature from a multi-omic dataset.","authors":"Baptiste Bauvin, Thibaud Godon, Guillaume Bachelot, Claudia Carpentier, Riikka Huusaari, Maxime Deraspe, Juho Rousu, Caroline Quach, Jacques Corbeil","doi":"10.3389/fbinf.2025.1645785","DOIUrl":"10.3389/fbinf.2025.1645785","url":null,"abstract":"<p><strong>Introduction: </strong>The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinical, proteomic, and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery.</p><p><strong>Methods: </strong>As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures.</p><p><strong>Results: </strong>Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable.</p><p><strong>Discussion: </strong>This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1645785"},"PeriodicalIF":3.9,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12497780/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145245939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}