Pub Date : 2025-11-25eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1694009
Suraiya Akhter, John H Miller
Bacteriocins offer a promising solution to antibiotic resistance, possessing the ability to target a wide range of bacteria with precision. Thus, there is an urgent need for a computational model to predict new bacteriocins and aid in drug development. This work centers on constructing web-based predictive models using the XGBoost machine learning algorithm, based on the physicochemical properties, structural characteristics, and sequence profiles of protein sequences. We employed correlation analyses, cross-validation, and hypergraph-based techniques to select features. Cross-validated feature selection (CVFS) partitions the dataset, selects features within each partition, and identifies common features, ensuring representativeness. On the contrary, hypergraph-based feature evaluation (HFE) focuses on minimizing hypergraph cut conductance, leveraging higher-order data relationships to precisely utilize information regarding feature and sample correlations. The XGBoost models were built using the selected features obtained from these two feature evaluation methods. We also analyzed the feature contributions directly from the best model using SHapley Additive exPlanations (SHAP). Our HFE-based approach achieved 99.11% accuracy and an AUC of 0.9974 on the test data, overall outperforming the CVFS-based feature evaluation method and yielding results comparable to existing approaches. The most influential features are related to solvent accessibility for buried residues, followed by the composition of cysteine. Our web application, accessible at https://shiny.tricities.wsu.edu/bacteriocin-prediction/, offers prediction results, probability scores, and SHAP plots using both cross-validation- and hypergraph-based methods, along with previously implemented approaches for feature selection.
{"title":"Bacteriocin prediction through cross-validation-based and hypergraph-based feature evaluation approaches.","authors":"Suraiya Akhter, John H Miller","doi":"10.3389/fbinf.2025.1694009","DOIUrl":"10.3389/fbinf.2025.1694009","url":null,"abstract":"<p><p>Bacteriocins offer a promising solution to antibiotic resistance, possessing the ability to target a wide range of bacteria with precision. Thus, there is an urgent need for a computational model to predict new bacteriocins and aid in drug development. This work centers on constructing web-based predictive models using the XGBoost machine learning algorithm, based on the physicochemical properties, structural characteristics, and sequence profiles of protein sequences. We employed correlation analyses, cross-validation, and hypergraph-based techniques to select features. Cross-validated feature selection (CVFS) partitions the dataset, selects features within each partition, and identifies common features, ensuring representativeness. On the contrary, hypergraph-based feature evaluation (HFE) focuses on minimizing hypergraph cut conductance, leveraging higher-order data relationships to precisely utilize information regarding feature and sample correlations. The XGBoost models were built using the selected features obtained from these two feature evaluation methods. We also analyzed the feature contributions directly from the best model using SHapley Additive exPlanations (SHAP). Our HFE-based approach achieved 99.11% accuracy and an AUC of 0.9974 on the test data, overall outperforming the CVFS-based feature evaluation method and yielding results comparable to existing approaches. The most influential features are related to solvent accessibility for buried residues, followed by the composition of cysteine. Our web application, accessible at https://shiny.tricities.wsu.edu/bacteriocin-prediction/, offers prediction results, probability scores, and SHAP plots using both cross-validation- and hypergraph-based methods, along with previously implemented approaches for feature selection.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1694009"},"PeriodicalIF":3.9,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12685867/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145727608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dual RNA-sequencing enables simultaneous profiling of protein-coding and non-coding transcripts from two interacting organisms, an essential capability when physical separation is difficult, such as in host-parasite or cross-kingdom interactions (e.g., plant-plant or host-pathogen systems). By allowing in silico separation of mixed reads, dual RNA-seq reveals the transcriptomic dynamics of both partners during interaction. However, existing analysis workflows often require programming expertise, limiting accessibility. We present inDAGO, a free, open-source, cross-platform graphical user interface designed for biologists without coding skills. inDAGO supports both bulk and dual RNA sequencing, with dual RNA sequencing further accommodating both sequential and combined approaches. The interface guides users through key analysis steps, including quality control, read alignment, read summarization, exploratory data analysis, and identification of differentially expressed genes, while generating intermediate outputs and publication-ready plots. Optimized for speed and efficiency, inDAGO performs complete analyses on a standard laptop (16 GB RAM) without requiring high-performance computing. We validated inDAGO using diverse real datasets to demonstrate its reliability and usability. inDAGO, available on CRAN (https://cran.r-project.org/web/packages/inDAGO/) and GitHub (https://github.com/inDAGOverse/inDAGO), lowers the technical barrier to dual RNA-seq by enabling robust, reproducible analyses, even for users without coding experience.
{"title":"inDAGO: a user-friendly interface for seamless dual and bulk RNA-Seq analysis.","authors":"Gaetano Aufiero, Carmine Fruggiero, Nunzio D'Agostino","doi":"10.3389/fbinf.2025.1696823","DOIUrl":"10.3389/fbinf.2025.1696823","url":null,"abstract":"<p><p>Dual RNA-sequencing enables simultaneous profiling of protein-coding and non-coding transcripts from two interacting organisms, an essential capability when physical separation is difficult, such as in host-parasite or cross-kingdom interactions (e.g., plant-plant or host-pathogen systems). By allowing <i>in silico</i> separation of mixed reads, dual RNA-seq reveals the transcriptomic dynamics of both partners during interaction. However, existing analysis workflows often require programming expertise, limiting accessibility. We present inDAGO, a free, open-source, cross-platform graphical user interface designed for biologists without coding skills. inDAGO supports both bulk and dual RNA sequencing, with dual RNA sequencing further accommodating both sequential and combined approaches. The interface guides users through key analysis steps, including quality control, read alignment, read summarization, exploratory data analysis, and identification of differentially expressed genes, while generating intermediate outputs and publication-ready plots. Optimized for speed and efficiency, inDAGO performs complete analyses on a standard laptop (16 GB RAM) without requiring high-performance computing. We validated inDAGO using diverse real datasets to demonstrate its reliability and usability. inDAGO, available on CRAN (https://cran.r-project.org/web/packages/inDAGO/) and GitHub (https://github.com/inDAGOverse/inDAGO), lowers the technical barrier to dual RNA-seq by enabling robust, reproducible analyses, even for users without coding experience.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1696823"},"PeriodicalIF":3.9,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12678335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Gut fungi play crucial roles in human health. The profiling of the human gut mycobiome continues to progress. However, adjustments in the selection of ribosomal DNA marker regions can substantially affect the taxonomic resolution of a population. In particular, the impact of using primers' combinations is insufficiently defined. In this study, we investigated the performance of three targeted sequencing regions, ITS1, ITS2 and 18S rRNA, separately and in combination.
Methods: Eight fecal samples from healthy individuals (n = 4) and cancer patients (n = 4) were selected as proof of principle for amplicon-based sequencing conducted with the DNBSEQ™ sequencing system. Quality-filtered reads were grouped into operational taxonomic units (OTUs) via USEARCH and categorized using the SILVA (18S) and UNITE (ITS) databases. Downstream bioinformatics encompassed diversity analyses, principal component analysis (PCA), and biomarker detection via linear discriminant analysis effect size (LEfSe). To improve taxonomic coverage and compositional understanding, data were examined using ALDEx2 with centered log-ratio (CLR) transformation, facilitating reliable differential abundance and effect size assessment in small sample metagenomic contexts.
Results and discussion: Among primers, ITS2 and ITS1 enhanced the coverage of identified taxa, with operational taxonomic unit quantities of 183 and 158, respectively, compared to 58 OTUs of 18S. Accordingly, among primer combinations tested, the triple integration of ITS1-ITS2-18S produced the highest fungal richness, while the dual ITS1-ITS2 combined datasets enhanced group discrimination analysis, showing enrichment of Candida albicans and scarcity of Penicillium sp. in cancer patients. Our findings based on ITS sequencing and the combination of ITS1 and ITS2 provide instructive information on the composition and dynamics of gut fungi in our initial test subjects, enhancing our understanding of their roles in gut homeostasis and the microbial shifts associated with cancer. Despite our approach being conducted with a limited cohort to establish methodological feasibility, it brings attention to multi-marker strategies, demonstrating that integrated primer datasets surpass traditional single-marker methods in both taxonomic coverage and biomarker detection sensitivity in low-biomass fecal samples. Our research provides a reliable starting point for future studies on gut mycobiome in both healthy and diseased individuals, which could lead to better diagnostics and treatments based on microbiome profiles.
{"title":"Multi-marker comparative analysis of 18S, ITS1, and ITS2 primers for human gut mycobiome profiling.","authors":"Hiba Orsud, Sumaya Zoughbor, Fatima AlDhaheri, Khalid Hajissa, Manar Refaey, Suad Ajab, Khaled Alswaider, Nora Mohamed, Obaid Alkaabi, Zakeya Al Rasbi","doi":"10.3389/fbinf.2025.1690766","DOIUrl":"10.3389/fbinf.2025.1690766","url":null,"abstract":"<p><strong>Background: </strong>Gut fungi play crucial roles in human health. The profiling of the human gut mycobiome continues to progress. However, adjustments in the selection of ribosomal DNA marker regions can substantially affect the taxonomic resolution of a population. In particular, the impact of using primers' combinations is insufficiently defined. In this study, we investigated the performance of three targeted sequencing regions, ITS1, ITS2 and 18S rRNA, separately and in combination.</p><p><strong>Methods: </strong>Eight fecal samples from healthy individuals (n = 4) and cancer patients (n = 4) were selected as proof of principle for amplicon-based sequencing conducted with the DNBSEQ™ sequencing system. Quality-filtered reads were grouped into operational taxonomic units (OTUs) via USEARCH and categorized using the SILVA (18S) and UNITE (ITS) databases. Downstream bioinformatics encompassed diversity analyses, principal component analysis (PCA), and biomarker detection via linear discriminant analysis effect size (LEfSe). To improve taxonomic coverage and compositional understanding, data were examined using ALDEx2 with centered log-ratio (CLR) transformation, facilitating reliable differential abundance and effect size assessment in small sample metagenomic contexts.</p><p><strong>Results and discussion: </strong>Among primers, ITS2 and ITS1 enhanced the coverage of identified taxa, with operational taxonomic unit quantities of 183 and 158, respectively, compared to 58 OTUs of 18S. Accordingly, among primer combinations tested, the triple integration of ITS1-ITS2-18S produced the highest fungal richness, while the dual ITS1-ITS2 combined datasets enhanced group discrimination analysis, showing enrichment of <i>Candida albican</i>s and scarcity of <i>Penicillium sp</i>. in cancer patients. Our findings based on ITS sequencing and the combination of ITS1 and ITS2 provide instructive information on the composition and dynamics of gut fungi in our initial test subjects, enhancing our understanding of their roles in gut homeostasis and the microbial shifts associated with cancer. Despite our approach being conducted with a limited cohort to establish methodological feasibility, it brings attention to multi-marker strategies, demonstrating that integrated primer datasets surpass traditional single-marker methods in both taxonomic coverage and biomarker detection sensitivity in low-biomass fecal samples. Our research provides a reliable starting point for future studies on gut mycobiome in both healthy and diseased individuals, which could lead to better diagnostics and treatments based on microbiome profiles.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1690766"},"PeriodicalIF":3.9,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12672528/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1690229
Konstantinos Perperidis, Themis P Exarchos, Aristidis G Vrahatis, Panagiotis Vlamos, Marios G Krokidis
Parkinson's disease (PD) is the most common neurodegenerative movement disorder. The pathophysiology is defined by a loss of dopaminergic neurons in the substantia nigra pars compacta, however recent studies suggest that the peripheral immune system may participate in PD development. Herein, we analyzed molecular insights examining RNA-seq data obtained from the peripheral blood of both Parkinson's disease patients and healthy control. Although all age and gender groups were analyzed, emphasis is given on individuals aged 50-70, the most prevalent group for Parkinson's diagnosis. The computational workflow comprises both bioinformatics analyses and machine learning processes and the yield of the pipeline includes transcripts ranked by their level of significance, which could serve as reliable genetic signatures. Classification outcomes are also examined with a focus on the significance of selected features, ultimately facilitating the development of gene networks implicated in the disease. The thorough functional analysis of the most prominent genes, regarding their biological relevance to PD, indicates that the proposed framework has strong potential for identifying blood-based biomarkers of the disease. Moreover, this approach facilitates the application of machine learning techniques to RNA-seq data from complex disorders, enabling deeper insights into critical biological processes at the molecular level.
{"title":"Computational analysis of transcriptome data and mapping of functional networks in Parkinson's disease.","authors":"Konstantinos Perperidis, Themis P Exarchos, Aristidis G Vrahatis, Panagiotis Vlamos, Marios G Krokidis","doi":"10.3389/fbinf.2025.1690229","DOIUrl":"10.3389/fbinf.2025.1690229","url":null,"abstract":"<p><p>Parkinson's disease (PD) is the most common neurodegenerative movement disorder. The pathophysiology is defined by a loss of dopaminergic neurons in the substantia nigra pars compacta, however recent studies suggest that the peripheral immune system may participate in PD development. Herein, we analyzed molecular insights examining RNA-seq data obtained from the peripheral blood of both Parkinson's disease patients and healthy control. Although all age and gender groups were analyzed, emphasis is given on individuals aged 50-70, the most prevalent group for Parkinson's diagnosis. The computational workflow comprises both bioinformatics analyses and machine learning processes and the yield of the pipeline includes transcripts ranked by their level of significance, which could serve as reliable genetic signatures. Classification outcomes are also examined with a focus on the significance of selected features, ultimately facilitating the development of gene networks implicated in the disease. The thorough functional analysis of the most prominent genes, regarding their biological relevance to PD, indicates that the proposed framework has strong potential for identifying blood-based biomarkers of the disease. Moreover, this approach facilitates the application of machine learning techniques to RNA-seq data from complex disorders, enabling deeper insights into critical biological processes at the molecular level.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1690229"},"PeriodicalIF":3.9,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12672545/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1710926
Jack M Craig, Whitney L Fisher, Allan S Thomas, S Blair Hedges, Sudhir Kumar
Afrotheria, the superorder that includes aardvarks, elephants, elephant shrews, hyraxes, manatees, and tenrecs, is home to some of the most charismatic and well-studied animals on Earth. Here, we assemble a nearly taxonomically complete molecular timetree of Afrotheria using an integrative approach that combines a literature search for published timetrees, de novo dating of untimed molecular phylogenies, and inference of timetrees from new alignments. The resulting timetree sheds light on the impact of the Cretaceous-Paleogene (K-Pg) role ∼66 million years ago in the diversification of Afrotherian orders. The earliest divergence in the timetree of Afrotherian mammals predates the K-Pg event by 12 million years, followed by five interordinal divergences that occurred gradually over a 16-million-year period encompassing the K-Pg event.
{"title":"Completing a molecular timetree of Afrotheria.","authors":"Jack M Craig, Whitney L Fisher, Allan S Thomas, S Blair Hedges, Sudhir Kumar","doi":"10.3389/fbinf.2025.1710926","DOIUrl":"10.3389/fbinf.2025.1710926","url":null,"abstract":"<p><p>Afrotheria, the superorder that includes aardvarks, elephants, elephant shrews, hyraxes, manatees, and tenrecs, is home to some of the most charismatic and well-studied animals on Earth. Here, we assemble a nearly taxonomically complete molecular timetree of Afrotheria using an integrative approach that combines a literature search for published timetrees, <i>de novo</i> dating of untimed molecular phylogenies, and inference of timetrees from new alignments. The resulting timetree sheds light on the impact of the Cretaceous-Paleogene (K-Pg) role ∼66 million years ago in the diversification of Afrotherian orders. The earliest divergence in the timetree of Afrotherian mammals predates the K-Pg event by 12 million years, followed by five interordinal divergences that occurred gradually over a 16-million-year period encompassing the K-Pg event.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1710926"},"PeriodicalIF":3.9,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12672906/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1716375
Manisha Shah, Sivakumar Arumugam
Introduction: Tumor necrosis factor-alpha (TNF-alpha) is a central mediator of chronic inflammation and a validated therapeutic target in atherosclerosis and related cardiovascular disorders. Peptide therapeutics offer high specificity and low toxicity; however, few natural sequences have been optimized for durable TNF-alpha inhibition.
Methods: A dual in silico strategy was employed to identify potent inhibitors: (i) virtual screening of experimentally validated food-derived bioactive peptides and (ii) rational design of an N-C cyclized and disulfide-bridge peptide based on the TNF-alpha-TNFR1 interface. Molecular docking, 200-ns molecular dynamics simulations, and MM/PBSA free-energy analyses were performed.
Results: The selected peptides exhibited strong and persistent interactions with key TNF-alpha residues, particularly Tyr119. The cyclic analogue demonstrated deeper free-energy minima, higher binding affinity, and more stable hydrogen-bond networks than the linear sequence. ADMET profiling revealed superior metabolic stability, reduced plasma clearance, and no predicted cardiotoxicity.
Discussion: These results indicate that dietary peptides can serve as templates for TNF-alpha inhibition, and interface-guided cyclization rationally enhances stability, binding affinity, and drug-like properties. This study provides a mechanistic framework for developing food-derived peptides as next-generation TNF-alpha antagonists and supports United Nations SDGs 3 and 9 by promoting innovative, low-toxicity therapeutics for chronic inflammation and cardiovascular diseases.
{"title":"Food-derived linear vs. rationally designed cyclic peptides as potent TNF-alpha inhibitors: an integrative computational study.","authors":"Manisha Shah, Sivakumar Arumugam","doi":"10.3389/fbinf.2025.1716375","DOIUrl":"10.3389/fbinf.2025.1716375","url":null,"abstract":"<p><strong>Introduction: </strong>Tumor necrosis factor-alpha (TNF-alpha) is a central mediator of chronic inflammation and a validated therapeutic target in atherosclerosis and related cardiovascular disorders. Peptide therapeutics offer high specificity and low toxicity; however, few natural sequences have been optimized for durable TNF-alpha inhibition.</p><p><strong>Methods: </strong>A dual in silico strategy was employed to identify potent inhibitors: (i) virtual screening of experimentally validated food-derived bioactive peptides and (ii) rational design of an N-C cyclized and disulfide-bridge peptide based on the TNF-alpha-TNFR1 interface. Molecular docking, 200-ns molecular dynamics simulations, and MM/PBSA free-energy analyses were performed.</p><p><strong>Results: </strong>The selected peptides exhibited strong and persistent interactions with key TNF-alpha residues, particularly Tyr119. The cyclic analogue demonstrated deeper free-energy minima, higher binding affinity, and more stable hydrogen-bond networks than the linear sequence. ADMET profiling revealed superior metabolic stability, reduced plasma clearance, and no predicted cardiotoxicity.</p><p><strong>Discussion: </strong>These results indicate that dietary peptides can serve as templates for TNF-alpha inhibition, and interface-guided cyclization rationally enhances stability, binding affinity, and drug-like properties. This study provides a mechanistic framework for developing food-derived peptides as next-generation TNF-alpha antagonists and supports United Nations SDGs 3 and 9 by promoting innovative, low-toxicity therapeutics for chronic inflammation and cardiovascular diseases.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1716375"},"PeriodicalIF":3.9,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12669231/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145672802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Introduction: Histone-lysine N-methyltransferase 2D (KMT2D) is an H3K4 methyltransferase and a potential tumor suppressor with a crucial role in regulating gene expression. Its dysregulation has been implicated in developmental disorders and several types of cancers. Despite this, the molecular mechanisms that govern its activity remain largely elusive. Among these, post-translational modifications, especially phosphorylation, serve as an essential regulator, fine-tuning KMT2D stability, localization and functional interactions for maintaining cellular homeostasis. With over 173 phosphorylation sites reported, KMT2D is significantly regulated by kinases and exploring its phospho-regulatory network based on targeted in vitro approaches is challenging.
Methods: We systematically curated and integrated the global phosphoproteomic datasets, along with their corresponding experimental conditions, to comprehensively identify the phosphorylation events reported for KMT2D. The site exhibiting the highest frequency of detection across these datasets is considered the predominant phosphorylation site. To investigate its functional significance, we analyzed the proteins and their phosphorylation sites that are differentially co-regulated with the predominant site, as well as its associated upstream kinases and interacting proteins.
Results: Among the 173 reported phosphorylation sites representing KMT2D, Serine 2274 (S2274) emerged as the predominant site being detected in over 42% of diverse mass spectrometry-based phosphoproteomics datasets. This site lies within one of KMT2D's unique "LSPPP" motifs, suggesting a potential regulatory role. Detailed investigation on the differentially co-regulated protein phosphosites revealed the phosphorylation of KMT2D at S2274 is consistently and positively co-regulated with MAPK1/ERK2 activation, as well as with the proteins involved in the MAPK cascade, epigenetic regulation and cell differentiation. Notably, ERK2 was predicted as an upstream kinase targeting S2274, suggesting that KMT2D S2274 functions as a potential downstream effector of MEK-ERK signaling pathway, potentially linking to epigenetic regulation and cell differentiation. Further, our results highlighted a potential mechanistic link between disrupted phosphorylation at S2274 and the pathogenesis of Kabuki syndrome.
Discussion: This study delineates the phosphoregulatory network of KMT2D, positioning it as a dynamic epigenetic effector modulated by MEK-ERK signaling, with broader implications for cancer and developmental disorders.
{"title":"Role of histone-lysine N-methyltransferase 2D (KMT2D) in MEK-ERK signaling-mediated epigenetic regulation: a phosphoproteomics perspective.","authors":"Sreeshma Ravindran Kammarambath, Leona Dcunha, Athira Perunelly Gopalakrishnan, Amal Fahma, Neelam Krishna, Altaf Mahin, Samseera Ummar, Prathik Basthikoppa Shivamurthy, Inamul Hasan Madar, Rajesh Raju","doi":"10.3389/fbinf.2025.1683469","DOIUrl":"10.3389/fbinf.2025.1683469","url":null,"abstract":"<p><strong>Introduction: </strong>Histone-lysine N-methyltransferase 2D (KMT2D) is an H3K4 methyltransferase and a potential tumor suppressor with a crucial role in regulating gene expression. Its dysregulation has been implicated in developmental disorders and several types of cancers. Despite this, the molecular mechanisms that govern its activity remain largely elusive. Among these, post-translational modifications, especially phosphorylation, serve as an essential regulator, fine-tuning KMT2D stability, localization and functional interactions for maintaining cellular homeostasis. With over 173 phosphorylation sites reported, KMT2D is significantly regulated by kinases and exploring its phospho-regulatory network based on targeted <i>in vitro</i> approaches is challenging.</p><p><strong>Methods: </strong>We systematically curated and integrated the global phosphoproteomic datasets, along with their corresponding experimental conditions, to comprehensively identify the phosphorylation events reported for KMT2D. The site exhibiting the highest frequency of detection across these datasets is considered the predominant phosphorylation site. To investigate its functional significance, we analyzed the proteins and their phosphorylation sites that are differentially co-regulated with the predominant site, as well as its associated upstream kinases and interacting proteins.</p><p><strong>Results: </strong>Among the 173 reported phosphorylation sites representing KMT2D, Serine 2274 (S2274) emerged as the predominant site being detected in over 42% of diverse mass spectrometry-based phosphoproteomics datasets. This site lies within one of KMT2D's unique \"<i>LSPPP</i>\" motifs, suggesting a potential regulatory role. Detailed investigation on the differentially co-regulated protein phosphosites revealed the phosphorylation of KMT2D at S2274 is consistently and positively co-regulated with MAPK1/ERK2 activation, as well as with the proteins involved in the MAPK cascade, epigenetic regulation and cell differentiation. Notably, ERK2 was predicted as an upstream kinase targeting S2274, suggesting that KMT2D S2274 functions as a potential downstream effector of MEK-ERK signaling pathway, potentially linking to epigenetic regulation and cell differentiation. Further, our results highlighted a potential mechanistic link between disrupted phosphorylation at S2274 and the pathogenesis of Kabuki syndrome.</p><p><strong>Discussion: </strong>This study delineates the phosphoregulatory network of KMT2D, positioning it as a dynamic epigenetic effector modulated by MEK-ERK signaling, with broader implications for cancer and developmental disorders.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1683469"},"PeriodicalIF":3.9,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12669113/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145672858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1687617
Anthony Wong, Sanskruthi Guduri, TsungYen Chen, Kunal Patel
Introduction: Multi-target peptide therapeutics targeting glucagon receptor (GCGR), glucagon-like peptide-1 receptor (GLP1R), and glucose-dependent insulinotropic polypeptide receptor (GIPR) represent a promising approach for treating diabetes and obesity. Triple agonist peptides demonstrate promising therapeutic potential compared to single-target approaches, yet rational design remains computationally challenging due to complex sequence-structure activity relationships. Existing methods, primarily based on convolutional neural networks, impose limitations including fixed sequence lengths and inadequate representation of molecular topology. Graph Attention Networks (GAT) offer advantages in capturing molecular structures and variable-length peptide sequences while providing interpretable insights into receptor-specific binding determinants.
Methods: A dataset of 234 peptide sequences with experimentally determined binding affinities was compiled from multiple sources. Peptides were represented as molecular graphs with seven-dimensional node features encoding physicochemical properties and positional information. The GAT architecture employed a shared encoder with task-specific prediction heads, implementing transfer learning to address limited GIPR training data. Performance was evaluated using 5-fold cross-validation and independent validation on 24 literature-derived sequences. A genetic algorithm framework was developed for peptide sequence optimization, incorporating multi objective fitness evaluation based on predicted binding affinity, biological plausibility, and sequence novelty.
Results: Cross-validation demonstrated robust GAT performance across all receptors, with GCGR achieving high accuracy (AUC ROC: 0.915 ± 0.050), followed by GLP1R (AUC-ROC: 0.853 ± 0.059), and GIPR showing acceptable performance despite limited data (AUC-ROC: 0.907 ± 0.083). Comparative analysis revealed receptor-specific advantages: GAT significantly outperformed CNN for GCGR prediction (RMSE: 0.942 vs. 1.209, p = 0.0013), while CNN maintained superior GLP1R performance (RMSE: 0.552 vs. 0.723). Genetic algorithm optimization measurable improvement over baseline, with 4.0% fitness Enhancement and generation of 20 candidates exhibiting mean binding probabilities exceeding 0.5 across all targets. The GAT-based framework provides a computational approach in computational peptide design, demonstrating receptor-specific advantages and robust optimization capabilities.
Conclusion: Genetic algorithm optimization enables systematic exploration of sequence space within existing agonist scaffolds while maintaining biological constraints. This approach provides a rational framework for prioritizing experimental validation efforts in triple agonist development.
介绍:针对胰高血糖素受体(GCGR)、胰高血糖素样肽-1受体(GLP1R)和葡萄糖依赖性胰岛素性多肽受体(GIPR)的多靶点肽治疗是治疗糖尿病和肥胖的一种很有前景的方法。与单靶点方法相比,三重激动剂肽显示出有希望的治疗潜力,但由于复杂的序列-结构-活性关系,合理的设计在计算上仍然具有挑战性。现有的方法,主要基于卷积神经网络,施加限制,包括固定的序列长度和分子拓扑的不充分表示。图注意网络(GAT)在捕获分子结构和变长肽序列方面具有优势,同时为受体特异性结合决定因素提供了可解释的见解。方法:从多个来源收集经实验确定结合亲和力的234条肽序列。多肽被表示为具有7维节点特征的分子图,这些节点特征编码了多肽的物理化学性质和位置信息。GAT架构采用具有特定任务预测头的共享编码器,实现迁移学习以解决有限的GIPR训练数据。使用5倍交叉验证和对24个文献衍生序列的独立验证来评估性能。基于预测结合亲和度、生物合理性和序列新颖性的多目标适应度评估,构建了多肽序列优化的遗传算法框架。结果:交叉验证表明,GAT在所有受体上都表现良好,GCGR的准确度较高(AUC ROC: 0.915±0.050),GLP1R的AUC ROC: 0.853±0.059),GIPR的AUC ROC: 0.907±0.083),尽管数据有限,但仍表现良好。对比分析显示了受体特异性优势:GAT在GCGR预测方面明显优于CNN (RMSE: 0.942 vs. 1.209, p = 0.0013),而CNN在GLP1R预测方面保持了优势(RMSE: 0.552 vs. 0.723)。遗传算法优化了可测量的基线改进,适应度增强4.0%,生成的20个候选对象在所有目标上的平均绑定概率超过0.5。基于gat的框架为计算肽设计提供了一种计算方法,展示了受体特异性优势和强大的优化能力。结论:遗传算法优化可以在保持生物约束的情况下,系统地探索现有激动剂支架内的序列空间。这种方法为在三联激动剂开发中优先考虑实验验证工作提供了合理的框架。
{"title":"Machine learning-guided optimization of triple agonist peptide therapeutics for metabolic disease.","authors":"Anthony Wong, Sanskruthi Guduri, TsungYen Chen, Kunal Patel","doi":"10.3389/fbinf.2025.1687617","DOIUrl":"10.3389/fbinf.2025.1687617","url":null,"abstract":"<p><strong>Introduction: </strong>Multi-target peptide therapeutics targeting glucagon receptor (GCGR), glucagon-like peptide-1 receptor (GLP1R), and glucose-dependent insulinotropic polypeptide receptor (GIPR) represent a promising approach for treating diabetes and obesity. Triple agonist peptides demonstrate promising therapeutic potential compared to single-target approaches, yet rational design remains computationally challenging due to complex sequence-structure activity relationships. Existing methods, primarily based on convolutional neural networks, impose limitations including fixed sequence lengths and inadequate representation of molecular topology. Graph Attention Networks (GAT) offer advantages in capturing molecular structures and variable-length peptide sequences while providing interpretable insights into receptor-specific binding determinants.</p><p><strong>Methods: </strong>A dataset of 234 peptide sequences with experimentally determined binding affinities was compiled from multiple sources. Peptides were represented as molecular graphs with seven-dimensional node features encoding physicochemical properties and positional information. The GAT architecture employed a shared encoder with task-specific prediction heads, implementing transfer learning to address limited GIPR training data. Performance was evaluated using 5-fold cross-validation and independent validation on 24 literature-derived sequences. A genetic algorithm framework was developed for peptide sequence optimization, incorporating multi objective fitness evaluation based on predicted binding affinity, biological plausibility, and sequence novelty.</p><p><strong>Results: </strong>Cross-validation demonstrated robust GAT performance across all receptors, with GCGR achieving high accuracy (AUC ROC: 0.915 ± 0.050), followed by GLP1R (AUC-ROC: 0.853 ± 0.059), and GIPR showing acceptable performance despite limited data (AUC-ROC: 0.907 ± 0.083). Comparative analysis revealed receptor-specific advantages: GAT significantly outperformed CNN for GCGR prediction (RMSE: 0.942 vs. 1.209, p = 0.0013), while CNN maintained superior GLP1R performance (RMSE: 0.552 vs. 0.723). Genetic algorithm optimization measurable improvement over baseline, with 4.0% fitness Enhancement and generation of 20 candidates exhibiting mean binding probabilities exceeding 0.5 across all targets. The GAT-based framework provides a computational approach in computational peptide design, demonstrating receptor-specific advantages and robust optimization capabilities.</p><p><strong>Conclusion: </strong>Genetic algorithm optimization enables systematic exploration of sequence space within existing agonist scaffolds while maintaining biological constraints. This approach provides a rational framework for prioritizing experimental validation efforts in triple agonist development.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1687617"},"PeriodicalIF":3.9,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12665757/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145662025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1663846
Osasan Stephen Adebayo, George Oche Ambrose, Daramola Olusola, Adefolalu Oluwafemi, Hind A Alzahrani, Abdulkarim Hasan
Introduction: KRAS mutations are key oncogenic drivers in lung cancer, yet effective pharmacological targeting has remained a major challenge due to the protein's elusive and dynamic binding pockets. Computational modeling offers a promising route to identify novel inhibitors with improved potency and selectivity.
Methods: A quantitative structure-activity relationship (QSAR) modeling approach was developed to predict the inhibitory potency (pIC50) of KRAS inhibitors and support de novo drug design. Molecular descriptors for 62 inhibitors retrieved from the ChEMBL database (CHEMBL4354832) were computed using Chemopy. Following descriptor normalization and dimensionality reduction, five machine learning algorithm spartial least squares (PLS), random forest (RF), stepwise multiple linear regression (MLR), genetic algorithm optimized MLR (GA-MLR), and XGBoost were applied. Model performance was evaluated using R2, RMSE, and MAE, while permutation-based importance and SHAP analyses provided feature interpretability.
Results: Among the models tested, PLS exhibited the best predictive performance (R2 = 0.851; RMSE = 0.292), followed by RF (R2 = 0.796). The GA-MLR model, based on eight optimized molecular descriptors, achieved good interpretability and robust internal validation (R2 = 0.677). Virtual screening of 56 de novo designed compounds within the model's applicability domain identified compound C9 with a predicted pIC50) of 8.11 as the most promising hit.
Discussion: This integrative QSAR modeling and de novo design framework effectively predicted the bioactivity of KRAS inhibitors and facilitated the identification of novel candidate molecules. The findings demonstrate the utility of combining interpretable machine learning models with virtual screening to accelerate the discovery of potent KRAS inhibitors for lung cancer therapy.
{"title":"QSAR-guided discovery of novel KRAS inhibitors for lung cancer therapy.","authors":"Osasan Stephen Adebayo, George Oche Ambrose, Daramola Olusola, Adefolalu Oluwafemi, Hind A Alzahrani, Abdulkarim Hasan","doi":"10.3389/fbinf.2025.1663846","DOIUrl":"10.3389/fbinf.2025.1663846","url":null,"abstract":"<p><strong>Introduction: </strong>KRAS mutations are key oncogenic drivers in lung cancer, yet effective pharmacological targeting has remained a major challenge due to the protein's elusive and dynamic binding pockets. Computational modeling offers a promising route to identify novel inhibitors with improved potency and selectivity.</p><p><strong>Methods: </strong>A quantitative structure-activity relationship (QSAR) modeling approach was developed to predict the inhibitory potency (pIC<sub>50</sub>) of KRAS inhibitors and support <i>de novo</i> drug design. Molecular descriptors for 62 inhibitors retrieved from the ChEMBL database (CHEMBL4354832) were computed using Chemopy. Following descriptor normalization and dimensionality reduction, five machine learning algorithm spartial least squares (PLS), random forest (RF), stepwise multiple linear regression (MLR), genetic algorithm optimized MLR (GA-MLR), and XGBoost were applied. Model performance was evaluated using <i>R</i> <sup>2</sup>, RMSE, and MAE, while permutation-based importance and SHAP analyses provided feature interpretability.</p><p><strong>Results: </strong>Among the models tested, PLS exhibited the best predictive performance (<i>R</i> <sup>2</sup> = 0.851; RMSE = 0.292), followed by RF (<i>R</i> <sup>2</sup> = 0.796). The GA-MLR model, based on eight optimized molecular descriptors, achieved good interpretability and robust internal validation (<i>R</i> <sup>2</sup> = 0.677). Virtual screening of 56 <i>de novo</i> designed compounds within the model's applicability domain identified compound C9 with a predicted pIC<sub>50</sub>) of 8.11 as the most promising hit.</p><p><strong>Discussion: </strong>This integrative QSAR modeling and <i>de novo</i> design framework effectively predicted the bioactivity of KRAS inhibitors and facilitated the identification of novel candidate molecules. The findings demonstrate the utility of combining interpretable machine learning models with virtual screening to accelerate the discovery of potent KRAS inhibitors for lung cancer therapy.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1663846"},"PeriodicalIF":3.9,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12665777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145662629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-13eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1636240
Aanya Gupta, Koji Abe, Holden T Maecker
Introduction: FluPRINT is a multi-omics dataset that measures donors' protein expression and cell counts across various assays. Donors were also assigned a binary value (0 or 1), being labeled as high responders if they had a fold change ≥4 of the antibody titer for hemagglutination inhibition (HAI) from day 0 to day 28, and low responders otherwise (0). In this project, we used the MOFA and Stabl algorithms to analyze FluPRINT, estimate the population structure from the data, and identify the most important features for predicting response to the vaccine.
Methods: The preprocessing of the dataset included removing repeat features, scaling by assay, and removing outliers. Since Stabl does not directly address missing values, features with high amounts of missing values were removed and the remaining were ignored.
Results: MOFA identified the top feature in structure extraction as IL neg 2 CD4 pos CD45Ra neg pSTAT5. MOFA explains well the variance of the data while also choosing features that have good significance, as illustrated by their significant p-values (p < 0.05). Stabl found the top feature for explaining the outcome to be CD33- CD3+ CD4+ CD25hiCD127low CD161+ CD45RA + Tregs, which matched the top result of previously published analysis. MOFA's features achieved an AUROC of 0.616 (95% CI of 0.426-0.806), and Stabl's achieved an AUROC of 0.634 (95% CI of 0.432-0.823).
Discussion: Our research addresses a key knowledge gap: understanding how these fundamentally different analytical approaches perform when analyzing the same complex dataset. Our exploration evaluates their respective strengths, limitations, and biological insights and provides guidance on using MOFA and Stabl to find the best predictive cell subsets and features for understanding large immunological multi-omics data. The code for this project can be found at https://github.com/aanya21gupta/fluprint.
{"title":"Comprehensive analysis of multi-omics vaccine response data using MOFA and Stabl algorithms.","authors":"Aanya Gupta, Koji Abe, Holden T Maecker","doi":"10.3389/fbinf.2025.1636240","DOIUrl":"10.3389/fbinf.2025.1636240","url":null,"abstract":"<p><strong>Introduction: </strong>FluPRINT is a multi-omics dataset that measures donors' protein expression and cell counts across various assays. Donors were also assigned a binary value (0 or 1), being labeled as high responders if they had a fold change ≥4 of the antibody titer for hemagglutination inhibition (HAI) from day 0 to day 28, and low responders otherwise (0). In this project, we used the MOFA and Stabl algorithms to analyze FluPRINT, estimate the population structure from the data, and identify the most important features for predicting response to the vaccine.</p><p><strong>Methods: </strong>The preprocessing of the dataset included removing repeat features, scaling by assay, and removing outliers. Since Stabl does not directly address missing values, features with high amounts of missing values were removed and the remaining were ignored.</p><p><strong>Results: </strong>MOFA identified the top feature in structure extraction as IL neg 2 CD4 pos CD45Ra neg pSTAT5. MOFA explains well the variance of the data while also choosing features that have good significance, as illustrated by their significant p-values (p < 0.05). Stabl found the top feature for explaining the outcome to be CD33<sup>-</sup> CD3<sup>+</sup> CD4<sup>+</sup> CD25hiCD127low CD161+ CD45RA + Tregs, which matched the top result of previously published analysis. MOFA's features achieved an AUROC of 0.616 (95% CI of 0.426-0.806), and Stabl's achieved an AUROC of 0.634 (95% CI of 0.432-0.823).</p><p><strong>Discussion: </strong>Our research addresses a key knowledge gap: understanding how these fundamentally different analytical approaches perform when analyzing the same complex dataset. Our exploration evaluates their respective strengths, limitations, and biological insights and provides guidance on using MOFA and Stabl to find the best predictive cell subsets and features for understanding large immunological multi-omics data. The code for this project can be found at https://github.com/aanya21gupta/fluprint.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1636240"},"PeriodicalIF":3.9,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657425/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145649743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}