Frontiers in bioinformatics最新文献_第5页

Bacteriocin prediction through cross-validation-based and hypergraph-based feature evaluation approaches. 通过基于交叉验证和基于超图的特征评估方法预测细菌素。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-25 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1694009

Suraiya Akhter, John H Miller

Bacteriocins offer a promising solution to antibiotic resistance, possessing the ability to target a wide range of bacteria with precision. Thus, there is an urgent need for a computational model to predict new bacteriocins and aid in drug development. This work centers on constructing web-based predictive models using the XGBoost machine learning algorithm, based on the physicochemical properties, structural characteristics, and sequence profiles of protein sequences. We employed correlation analyses, cross-validation, and hypergraph-based techniques to select features. Cross-validated feature selection (CVFS) partitions the dataset, selects features within each partition, and identifies common features, ensuring representativeness. On the contrary, hypergraph-based feature evaluation (HFE) focuses on minimizing hypergraph cut conductance, leveraging higher-order data relationships to precisely utilize information regarding feature and sample correlations. The XGBoost models were built using the selected features obtained from these two feature evaluation methods. We also analyzed the feature contributions directly from the best model using SHapley Additive exPlanations (SHAP). Our HFE-based approach achieved 99.11% accuracy and an AUC of 0.9974 on the test data, overall outperforming the CVFS-based feature evaluation method and yielding results comparable to existing approaches. The most influential features are related to solvent accessibility for buried residues, followed by the composition of cysteine. Our web application, accessible at https://shiny.tricities.wsu.edu/bacteriocin-prediction/, offers prediction results, probability scores, and SHAP plots using both cross-validation- and hypergraph-based methods, along with previously implemented approaches for feature selection.

细菌素为抗生素耐药性提供了一个有希望的解决方案，它具有精确靶向多种细菌的能力。因此，迫切需要一种计算模型来预测新的细菌素并帮助药物开发。这项工作的重点是基于蛋白质序列的物理化学性质、结构特征和序列特征，使用XGBoost机器学习算法构建基于web的预测模型。我们采用相关分析、交叉验证和基于超图的技术来选择特征。交叉验证特征选择（CVFS）对数据集进行分区，在每个分区中选择特征，并识别公共特征，确保代表性。相反，基于超图的特征评估（HFE）侧重于最小化超图切割电导，利用高阶数据关系来精确地利用有关特征和样本相关性的信息。利用这两种特征评价方法得到的选择特征构建XGBoost模型。我们还使用SHapley加性解释（SHAP）直接分析了最佳模型的特征贡献。我们基于hfe的方法在测试数据上实现了99.11%的准确率和0.9974的AUC，总体上优于基于cvfs的特征评估方法，并产生与现有方法相当的结果。最具影响力的特征与埋藏残留物的溶剂可及性有关，其次是半胱氨酸的组成。我们的web应用程序（可访问https://shiny.tricities.wsu.edu/bacteriocin-prediction/）使用基于交叉验证和超图的方法提供预测结果、概率分数和SHAP图，以及以前实现的特征选择方法。

{"title":"Bacteriocin prediction through cross-validation-based and hypergraph-based feature evaluation approaches.","authors":"Suraiya Akhter, John H Miller","doi":"10.3389/fbinf.2025.1694009","DOIUrl":"10.3389/fbinf.2025.1694009","url":null,"abstract":"Bacteriocins offer a promising solution to antibiotic resistance, possessing the ability to target a wide range of bacteria with precision. Thus, there is an urgent need for a computational model to predict new bacteriocins and aid in drug development. This work centers on constructing web-based predictive models using the XGBoost machine learning algorithm, based on the physicochemical properties, structural characteristics, and sequence profiles of protein sequences. We employed correlation analyses, cross-validation, and hypergraph-based techniques to select features. Cross-validated feature selection (CVFS) partitions the dataset, selects features within each partition, and identifies common features, ensuring representativeness. On the contrary, hypergraph-based feature evaluation (HFE) focuses on minimizing hypergraph cut conductance, leveraging higher-order data relationships to precisely utilize information regarding feature and sample correlations. The XGBoost models were built using the selected features obtained from these two feature evaluation methods. We also analyzed the feature contributions directly from the best model using SHapley Additive exPlanations (SHAP). Our HFE-based approach achieved 99.11% accuracy and an AUC of 0.9974 on the test data, overall outperforming the CVFS-based feature evaluation method and yielding results comparable to existing approaches. The most influential features are related to solvent accessibility for buried residues, followed by the composition of cysteine. Our web application, accessible at https://shiny.tricities.wsu.edu/bacteriocin-prediction/, offers prediction results, probability scores, and SHAP plots using both cross-validation- and hypergraph-based methods, along with previously implemented approaches for feature selection.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1694009"},"PeriodicalIF":3.9,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12685867/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145727608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

inDAGO: a user-friendly interface for seamless dual and bulk RNA-Seq analysis. inDAGO：一个用户友好的界面，用于无缝双和批量RNA-Seq分析。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-21 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1696823

Gaetano Aufiero, Carmine Fruggiero, Nunzio D'Agostino

Dual RNA-sequencing enables simultaneous profiling of protein-coding and non-coding transcripts from two interacting organisms, an essential capability when physical separation is difficult, such as in host-parasite or cross-kingdom interactions (e.g., plant-plant or host-pathogen systems). By allowing in silico separation of mixed reads, dual RNA-seq reveals the transcriptomic dynamics of both partners during interaction. However, existing analysis workflows often require programming expertise, limiting accessibility. We present inDAGO, a free, open-source, cross-platform graphical user interface designed for biologists without coding skills. inDAGO supports both bulk and dual RNA sequencing, with dual RNA sequencing further accommodating both sequential and combined approaches. The interface guides users through key analysis steps, including quality control, read alignment, read summarization, exploratory data analysis, and identification of differentially expressed genes, while generating intermediate outputs and publication-ready plots. Optimized for speed and efficiency, inDAGO performs complete analyses on a standard laptop (16 GB RAM) without requiring high-performance computing. We validated inDAGO using diverse real datasets to demonstrate its reliability and usability. inDAGO, available on CRAN (https://cran.r-project.org/web/packages/inDAGO/) and GitHub (https://github.com/inDAGOverse/inDAGO), lowers the technical barrier to dual RNA-seq by enabling robust, reproducible analyses, even for users without coding experience.

双rna测序能够同时分析来自两个相互作用生物体的蛋白质编码和非编码转录物，这是在物理分离困难时的基本能力，例如在宿主-寄生虫或跨界相互作用中（例如，植物-植物或宿主-病原体系统）。通过允许混合读段的硅分离，双RNA-seq揭示了相互作用过程中双方的转录组动力学。然而，现有的分析工作流通常需要编程专业知识，限制了可访问性。我们介绍了inDAGO，一个免费的、开源的、跨平台的图形用户界面，专为没有编码技能的生物学家设计。inDAGO支持大量和双RNA测序，双RNA测序进一步适应顺序和组合方法。该界面引导用户完成关键的分析步骤，包括质量控制、读取比对、读取摘要、探索性数据分析和差异表达基因的鉴定，同时生成中间输出和准备发表的图表。针对速度和效率进行了优化，inDAGO可以在标准笔记本电脑（16 GB RAM）上执行完整的分析，而不需要高性能计算。我们使用不同的真实数据集验证了inDAGO，以证明其可靠性和可用性。inDAGO可在CRAN （https://cran.r-project.org/web/packages/inDAGO/）和GitHub （https://github.com/inDAGOverse/inDAGO）上获得，通过实现稳健、可重复的分析，降低了双rna测序的技术障碍，即使对于没有编码经验的用户也是如此。

{"title":"inDAGO: a user-friendly interface for seamless dual and bulk RNA-Seq analysis.","authors":"Gaetano Aufiero, Carmine Fruggiero, Nunzio D'Agostino","doi":"10.3389/fbinf.2025.1696823","DOIUrl":"10.3389/fbinf.2025.1696823","url":null,"abstract":"Dual RNA-sequencing enables simultaneous profiling of protein-coding and non-coding transcripts from two interacting organisms, an essential capability when physical separation is difficult, such as in host-parasite or cross-kingdom interactions (e.g., plant-plant or host-pathogen systems). By allowing in silico separation of mixed reads, dual RNA-seq reveals the transcriptomic dynamics of both partners during interaction. However, existing analysis workflows often require programming expertise, limiting accessibility. We present inDAGO, a free, open-source, cross-platform graphical user interface designed for biologists without coding skills. inDAGO supports both bulk and dual RNA sequencing, with dual RNA sequencing further accommodating both sequential and combined approaches. The interface guides users through key analysis steps, including quality control, read alignment, read summarization, exploratory data analysis, and identification of differentially expressed genes, while generating intermediate outputs and publication-ready plots. Optimized for speed and efficiency, inDAGO performs complete analyses on a standard laptop (16 GB RAM) without requiring high-performance computing. We validated inDAGO using diverse real datasets to demonstrate its reliability and usability. inDAGO, available on CRAN (https://cran.r-project.org/web/packages/inDAGO/) and GitHub (https://github.com/inDAGOverse/inDAGO), lowers the technical barrier to dual RNA-seq by enabling robust, reproducible analyses, even for users without coding experience.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1696823"},"PeriodicalIF":3.9,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12678335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-marker comparative analysis of 18S, ITS1, and ITS2 primers for human gut mycobiome profiling. 18S、ITS1和ITS2引物用于人肠道菌群分析的多标记比较分析。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-19 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1690766

Hiba Orsud, Sumaya Zoughbor, Fatima AlDhaheri, Khalid Hajissa, Manar Refaey, Suad Ajab, Khaled Alswaider, Nora Mohamed, Obaid Alkaabi, Zakeya Al Rasbi

Background: Gut fungi play crucial roles in human health. The profiling of the human gut mycobiome continues to progress. However, adjustments in the selection of ribosomal DNA marker regions can substantially affect the taxonomic resolution of a population. In particular, the impact of using primers' combinations is insufficiently defined. In this study, we investigated the performance of three targeted sequencing regions, ITS1, ITS2 and 18S rRNA, separately and in combination.

Methods: Eight fecal samples from healthy individuals (n = 4) and cancer patients (n = 4) were selected as proof of principle for amplicon-based sequencing conducted with the DNBSEQ™ sequencing system. Quality-filtered reads were grouped into operational taxonomic units (OTUs) via USEARCH and categorized using the SILVA (18S) and UNITE (ITS) databases. Downstream bioinformatics encompassed diversity analyses, principal component analysis (PCA), and biomarker detection via linear discriminant analysis effect size (LEfSe). To improve taxonomic coverage and compositional understanding, data were examined using ALDEx2 with centered log-ratio (CLR) transformation, facilitating reliable differential abundance and effect size assessment in small sample metagenomic contexts.

Results and discussion: Among primers, ITS2 and ITS1 enhanced the coverage of identified taxa, with operational taxonomic unit quantities of 183 and 158, respectively, compared to 58 OTUs of 18S. Accordingly, among primer combinations tested, the triple integration of ITS1-ITS2-18S produced the highest fungal richness, while the dual ITS1-ITS2 combined datasets enhanced group discrimination analysis, showing enrichment of Candida albicans and scarcity of Penicillium sp. in cancer patients. Our findings based on ITS sequencing and the combination of ITS1 and ITS2 provide instructive information on the composition and dynamics of gut fungi in our initial test subjects, enhancing our understanding of their roles in gut homeostasis and the microbial shifts associated with cancer. Despite our approach being conducted with a limited cohort to establish methodological feasibility, it brings attention to multi-marker strategies, demonstrating that integrated primer datasets surpass traditional single-marker methods in both taxonomic coverage and biomarker detection sensitivity in low-biomass fecal samples. Our research provides a reliable starting point for future studies on gut mycobiome in both healthy and diseased individuals, which could lead to better diagnostics and treatments based on microbiome profiles.

背景：肠道真菌在人类健康中起着至关重要的作用。人类肠道菌群的分析继续取得进展。然而，核糖体DNA标记区域选择的调整可以实质性地影响种群的分类分辨率。特别是，使用引物组合的影响还没有得到充分的定义。在本研究中，我们分别和联合研究了ITS1、ITS2和18S rRNA三个靶向测序区域的性能。方法：选取健康个体（n = 4）和癌症患者（n = 4）的8份粪便样本作为原理证明，使用DNBSEQ™测序系统进行基于扩增子的测序。通过USEARCH将高质量过滤的reads分组为操作分类单元（otu），并使用SILVA （18S）和UNITE （ITS）数据库进行分类。下游生物信息学包括多样性分析、主成分分析（PCA）和通过线性判别分析效应大小（LEfSe）进行生物标志物检测。为了提高分类覆盖率和成分理解，使用ALDEx2和中心对数比（CLR）转换对数据进行了检查，以便在小样本宏基因组背景下进行可靠的差异丰度和效应大小评估。结果与讨论：在引物中，ITS2和ITS1增加了已鉴定分类群的覆盖范围，其操作分类单位数量分别为183和158，而18S的otu为58。因此，在测试的引物组合中，ITS1-ITS2- 18s的三重整合产生了最高的真菌丰富度，而ITS1-ITS2的双重组合数据集增强了群体判别分析，显示白色念珠菌在癌症患者中富集，而青霉属在癌症患者中缺乏。我们基于ITS测序和ITS1和ITS2组合的研究结果为我们的初始测试对象的肠道真菌的组成和动力学提供了指发性信息，增强了我们对它们在肠道稳态和与癌症相关的微生物转移中的作用的理解。尽管我们的方法是在有限的队列中进行的，以确定方法的可行性，但它引起了对多标记策略的关注，表明综合引物数据集在低生物量粪便样本的分类覆盖率和生物标记检测灵敏度方面都优于传统的单标记方法。我们的研究为未来健康和患病个体肠道菌群的研究提供了一个可靠的起点，这可能会导致基于微生物组谱的更好的诊断和治疗。

{"title":"Multi-marker comparative analysis of 18S, ITS1, and ITS2 primers for human gut mycobiome profiling.","authors":"Hiba Orsud, Sumaya Zoughbor, Fatima AlDhaheri, Khalid Hajissa, Manar Refaey, Suad Ajab, Khaled Alswaider, Nora Mohamed, Obaid Alkaabi, Zakeya Al Rasbi","doi":"10.3389/fbinf.2025.1690766","DOIUrl":"10.3389/fbinf.2025.1690766","url":null,"abstract":"Background: Gut fungi play crucial roles in human health. The profiling of the human gut mycobiome continues to progress. However, adjustments in the selection of ribosomal DNA marker regions can substantially affect the taxonomic resolution of a population. In particular, the impact of using primers' combinations is insufficiently defined. In this study, we investigated the performance of three targeted sequencing regions, ITS1, ITS2 and 18S rRNA, separately and in combination.Methods: Eight fecal samples from healthy individuals (n = 4) and cancer patients (n = 4) were selected as proof of principle for amplicon-based sequencing conducted with the DNBSEQ™ sequencing system. Quality-filtered reads were grouped into operational taxonomic units (OTUs) via USEARCH and categorized using the SILVA (18S) and UNITE (ITS) databases. Downstream bioinformatics encompassed diversity analyses, principal component analysis (PCA), and biomarker detection via linear discriminant analysis effect size (LEfSe). To improve taxonomic coverage and compositional understanding, data were examined using ALDEx2 with centered log-ratio (CLR) transformation, facilitating reliable differential abundance and effect size assessment in small sample metagenomic contexts.Results and discussion: Among primers, ITS2 and ITS1 enhanced the coverage of identified taxa, with operational taxonomic unit quantities of 183 and 158, respectively, compared to 58 OTUs of 18S. Accordingly, among primer combinations tested, the triple integration of ITS1-ITS2-18S produced the highest fungal richness, while the dual ITS1-ITS2 combined datasets enhanced group discrimination analysis, showing enrichment of Candida albicans and scarcity of Penicillium sp. in cancer patients. Our findings based on ITS sequencing and the combination of ITS1 and ITS2 provide instructive information on the composition and dynamics of gut fungi in our initial test subjects, enhancing our understanding of their roles in gut homeostasis and the microbial shifts associated with cancer. Despite our approach being conducted with a limited cohort to establish methodological feasibility, it brings attention to multi-marker strategies, demonstrating that integrated primer datasets surpass traditional single-marker methods in both taxonomic coverage and biomarker detection sensitivity in low-biomass fecal samples. Our research provides a reliable starting point for future studies on gut mycobiome in both healthy and diseased individuals, which could lead to better diagnostics and treatments based on microbiome profiles.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1690766"},"PeriodicalIF":3.9,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12672528/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computational analysis of transcriptome data and mapping of functional networks in Parkinson's disease. 帕金森病转录组数据的计算分析和功能网络的绘制。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-19 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1690229

Konstantinos Perperidis, Themis P Exarchos, Aristidis G Vrahatis, Panagiotis Vlamos, Marios G Krokidis

Parkinson's disease (PD) is the most common neurodegenerative movement disorder. The pathophysiology is defined by a loss of dopaminergic neurons in the substantia nigra pars compacta, however recent studies suggest that the peripheral immune system may participate in PD development. Herein, we analyzed molecular insights examining RNA-seq data obtained from the peripheral blood of both Parkinson's disease patients and healthy control. Although all age and gender groups were analyzed, emphasis is given on individuals aged 50-70, the most prevalent group for Parkinson's diagnosis. The computational workflow comprises both bioinformatics analyses and machine learning processes and the yield of the pipeline includes transcripts ranked by their level of significance, which could serve as reliable genetic signatures. Classification outcomes are also examined with a focus on the significance of selected features, ultimately facilitating the development of gene networks implicated in the disease. The thorough functional analysis of the most prominent genes, regarding their biological relevance to PD, indicates that the proposed framework has strong potential for identifying blood-based biomarkers of the disease. Moreover, this approach facilitates the application of machine learning techniques to RNA-seq data from complex disorders, enabling deeper insights into critical biological processes at the molecular level.

帕金森病（PD）是最常见的神经退行性运动障碍。病理生理学定义为黑质致密部多巴胺能神经元的丧失，然而最近的研究表明外周免疫系统可能参与PD的发展。在此，我们分析了从帕金森病患者和健康对照者的外周血中获得的RNA-seq数据的分子见解。虽然分析了所有年龄和性别群体，但重点是50-70岁的个体，这是帕金森病诊断最普遍的群体。计算工作流程包括生物信息学分析和机器学习过程，管道的产量包括按其显著性水平排序的转录本，这可以作为可靠的遗传签名。分类结果也被检查，重点是选定的特征的重要性，最终促进基因网络的发展与疾病有关。对最突出的基因进行了彻底的功能分析，就其与PD的生物学相关性而言，表明所提出的框架具有识别该疾病基于血液的生物标志物的强大潜力。此外，这种方法有助于将机器学习技术应用于复杂疾病的RNA-seq数据，从而在分子水平上更深入地了解关键的生物过程。

{"title":"Computational analysis of transcriptome data and mapping of functional networks in Parkinson's disease.","authors":"Konstantinos Perperidis, Themis P Exarchos, Aristidis G Vrahatis, Panagiotis Vlamos, Marios G Krokidis","doi":"10.3389/fbinf.2025.1690229","DOIUrl":"10.3389/fbinf.2025.1690229","url":null,"abstract":"Parkinson's disease (PD) is the most common neurodegenerative movement disorder. The pathophysiology is defined by a loss of dopaminergic neurons in the substantia nigra pars compacta, however recent studies suggest that the peripheral immune system may participate in PD development. Herein, we analyzed molecular insights examining RNA-seq data obtained from the peripheral blood of both Parkinson's disease patients and healthy control. Although all age and gender groups were analyzed, emphasis is given on individuals aged 50-70, the most prevalent group for Parkinson's diagnosis. The computational workflow comprises both bioinformatics analyses and machine learning processes and the yield of the pipeline includes transcripts ranked by their level of significance, which could serve as reliable genetic signatures. Classification outcomes are also examined with a focus on the significance of selected features, ultimately facilitating the development of gene networks implicated in the disease. The thorough functional analysis of the most prominent genes, regarding their biological relevance to PD, indicates that the proposed framework has strong potential for identifying blood-based biomarkers of the disease. Moreover, this approach facilitates the application of machine learning techniques to RNA-seq data from complex disorders, enabling deeper insights into critical biological processes at the molecular level.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1690229"},"PeriodicalIF":3.9,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12672545/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Completing a molecular timetree of Afrotheria. 完成了非洲虫的分子时间表。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-19 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1710926

Jack M Craig, Whitney L Fisher, Allan S Thomas, S Blair Hedges, Sudhir Kumar

Afrotheria, the superorder that includes aardvarks, elephants, elephant shrews, hyraxes, manatees, and tenrecs, is home to some of the most charismatic and well-studied animals on Earth. Here, we assemble a nearly taxonomically complete molecular timetree of Afrotheria using an integrative approach that combines a literature search for published timetrees, de novo dating of untimed molecular phylogenies, and inference of timetrees from new alignments. The resulting timetree sheds light on the impact of the Cretaceous-Paleogene (K-Pg) role ∼66 million years ago in the diversification of Afrotherian orders. The earliest divergence in the timetree of Afrotherian mammals predates the K-Pg event by 12 million years, followed by five interordinal divergences that occurred gradually over a 16-million-year period encompassing the K-Pg event.

非洲兽目，包括食蚁兽、大象、象鼩、水螅、海牛和腱动物在内的超级目，是地球上一些最具魅力和研究最充分的动物的家园。在这里，我们使用一种综合的方法，结合文献搜索已发表的时间树，重新确定非定时分子系统发育的时间，以及从新的比对中推断时间树，组装了一个几乎在分类上完整的非frotheria分子时间树。由此产生的时间表揭示了白垩纪-古近纪（K-Pg）在6600万年前对非洲人目多样化的影响。在非猿类哺乳动物的时间表中，最早的分化早于K-Pg事件1200万年，随后是5次间断性分化，在包括K-Pg事件在内的1600万年时间里逐渐发生。

引用次数: 0

Food-derived linear vs. rationally designed cyclic peptides as potent TNF-alpha inhibitors: an integrative computational study. 食物来源的线性与合理设计的环肽作为有效的tnf - α抑制剂：一项综合计算研究。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-18 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1716375

Manisha Shah, Sivakumar Arumugam

Introduction: Tumor necrosis factor-alpha (TNF-alpha) is a central mediator of chronic inflammation and a validated therapeutic target in atherosclerosis and related cardiovascular disorders. Peptide therapeutics offer high specificity and low toxicity; however, few natural sequences have been optimized for durable TNF-alpha inhibition.

Methods: A dual in silico strategy was employed to identify potent inhibitors: (i) virtual screening of experimentally validated food-derived bioactive peptides and (ii) rational design of an N-C cyclized and disulfide-bridge peptide based on the TNF-alpha-TNFR1 interface. Molecular docking, 200-ns molecular dynamics simulations, and MM/PBSA free-energy analyses were performed.

Results: The selected peptides exhibited strong and persistent interactions with key TNF-alpha residues, particularly Tyr119. The cyclic analogue demonstrated deeper free-energy minima, higher binding affinity, and more stable hydrogen-bond networks than the linear sequence. ADMET profiling revealed superior metabolic stability, reduced plasma clearance, and no predicted cardiotoxicity.

Discussion: These results indicate that dietary peptides can serve as templates for TNF-alpha inhibition, and interface-guided cyclization rationally enhances stability, binding affinity, and drug-like properties. This study provides a mechanistic framework for developing food-derived peptides as next-generation TNF-alpha antagonists and supports United Nations SDGs 3 and 9 by promoting innovative, low-toxicity therapeutics for chronic inflammation and cardiovascular diseases.

肿瘤坏死因子- α (tnf - α)是慢性炎症的中心介质，也是动脉粥样硬化和相关心血管疾病的有效治疗靶点。多肽治疗具有高特异性和低毒性；然而，很少有天然序列被优化为持久的tnf - α抑制。方法：采用双硅策略来鉴定有效的抑制剂：(i)虚拟筛选实验验证的食物来源的生物活性肽；（ii）基于tnf - α - tnfr1界面的N-C环化和二硫桥肽的合理设计。进行了分子对接、200-ns分子动力学模拟和MM/PBSA自由能分析。结果：所选择的肽与关键的tnf - α残基，特别是Tyr119，表现出强烈而持久的相互作用。与线性序列相比，循环模拟具有更深的自由能最小值，更高的结合亲和力和更稳定的氢键网络。ADMET分析显示代谢稳定性好，血浆清除率降低，没有预测的心脏毒性。讨论：这些结果表明，膳食肽可以作为抑制tnf - α的模板，界面引导的环化合理地增强了稳定性、结合亲和力和药物样性质。本研究为开发作为下一代tnf - α拮抗剂的食源性肽提供了机制框架，并通过促进慢性炎症和心血管疾病的创新、低毒性治疗来支持联合国可持续发展目标3和9。

{"title":"Food-derived linear vs. rationally designed cyclic peptides as potent TNF-alpha inhibitors: an integrative computational study.","authors":"Manisha Shah, Sivakumar Arumugam","doi":"10.3389/fbinf.2025.1716375","DOIUrl":"10.3389/fbinf.2025.1716375","url":null,"abstract":"Introduction: Tumor necrosis factor-alpha (TNF-alpha) is a central mediator of chronic inflammation and a validated therapeutic target in atherosclerosis and related cardiovascular disorders. Peptide therapeutics offer high specificity and low toxicity; however, few natural sequences have been optimized for durable TNF-alpha inhibition.Methods: A dual in silico strategy was employed to identify potent inhibitors: (i) virtual screening of experimentally validated food-derived bioactive peptides and (ii) rational design of an N-C cyclized and disulfide-bridge peptide based on the TNF-alpha-TNFR1 interface. Molecular docking, 200-ns molecular dynamics simulations, and MM/PBSA free-energy analyses were performed.Results: The selected peptides exhibited strong and persistent interactions with key TNF-alpha residues, particularly Tyr119. The cyclic analogue demonstrated deeper free-energy minima, higher binding affinity, and more stable hydrogen-bond networks than the linear sequence. ADMET profiling revealed superior metabolic stability, reduced plasma clearance, and no predicted cardiotoxicity.Discussion: These results indicate that dietary peptides can serve as templates for TNF-alpha inhibition, and interface-guided cyclization rationally enhances stability, binding affinity, and drug-like properties. This study provides a mechanistic framework for developing food-derived peptides as next-generation TNF-alpha antagonists and supports United Nations SDGs 3 and 9 by promoting innovative, low-toxicity therapeutics for chronic inflammation and cardiovascular diseases.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1716375"},"PeriodicalIF":3.9,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12669231/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145672802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Role of histone-lysine N-methyltransferase 2D (KMT2D) in MEK-ERK signaling-mediated epigenetic regulation: a phosphoproteomics perspective. 组蛋白赖氨酸n -甲基转移酶2D （KMT2D）在MEK-ERK信号介导的表观遗传调控中的作用：磷蛋白质组学视角

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-18 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1683469

Sreeshma Ravindran Kammarambath, Leona Dcunha, Athira Perunelly Gopalakrishnan, Amal Fahma, Neelam Krishna, Altaf Mahin, Samseera Ummar, Prathik Basthikoppa Shivamurthy, Inamul Hasan Madar, Rajesh Raju

Introduction: Histone-lysine N-methyltransferase 2D (KMT2D) is an H3K4 methyltransferase and a potential tumor suppressor with a crucial role in regulating gene expression. Its dysregulation has been implicated in developmental disorders and several types of cancers. Despite this, the molecular mechanisms that govern its activity remain largely elusive. Among these, post-translational modifications, especially phosphorylation, serve as an essential regulator, fine-tuning KMT2D stability, localization and functional interactions for maintaining cellular homeostasis. With over 173 phosphorylation sites reported, KMT2D is significantly regulated by kinases and exploring its phospho-regulatory network based on targeted in vitro approaches is challenging.

Methods: We systematically curated and integrated the global phosphoproteomic datasets, along with their corresponding experimental conditions, to comprehensively identify the phosphorylation events reported for KMT2D. The site exhibiting the highest frequency of detection across these datasets is considered the predominant phosphorylation site. To investigate its functional significance, we analyzed the proteins and their phosphorylation sites that are differentially co-regulated with the predominant site, as well as its associated upstream kinases and interacting proteins.

Results: Among the 173 reported phosphorylation sites representing KMT2D, Serine 2274 (S2274) emerged as the predominant site being detected in over 42% of diverse mass spectrometry-based phosphoproteomics datasets. This site lies within one of KMT2D's unique "LSPPP" motifs, suggesting a potential regulatory role. Detailed investigation on the differentially co-regulated protein phosphosites revealed the phosphorylation of KMT2D at S2274 is consistently and positively co-regulated with MAPK1/ERK2 activation, as well as with the proteins involved in the MAPK cascade, epigenetic regulation and cell differentiation. Notably, ERK2 was predicted as an upstream kinase targeting S2274, suggesting that KMT2D S2274 functions as a potential downstream effector of MEK-ERK signaling pathway, potentially linking to epigenetic regulation and cell differentiation. Further, our results highlighted a potential mechanistic link between disrupted phosphorylation at S2274 and the pathogenesis of Kabuki syndrome.

Discussion: This study delineates the phosphoregulatory network of KMT2D, positioning it as a dynamic epigenetic effector modulated by MEK-ERK signaling, with broader implications for cancer and developmental disorders.

组蛋白赖氨酸n -甲基转移酶2D （Histone-lysine N-methyltransferase 2D, KMT2D）是一种H3K4甲基转移酶，是一种潜在的肿瘤抑制因子，在调节基因表达中起着至关重要的作用。它的失调与发育障碍和几种癌症有关。尽管如此，控制其活性的分子机制在很大程度上仍然难以捉摸。其中，翻译后修饰，尤其是磷酸化，是维持细胞稳态的重要调节因子，微调KMT2D的稳定性、定位和功能相互作用。报道了超过173个磷酸化位点，KMT2D受到激酶的显著调控，基于体外靶向方法探索其磷酸化调控网络具有挑战性。方法：我们系统地整理和整合了全球磷酸化蛋白质组学数据集，以及相应的实验条件，以全面识别KMT2D的磷酸化事件。在这些数据集中显示出最高检测频率的位点被认为是主要的磷酸化位点。为了研究其功能意义，我们分析了与显性位点差异共调控的蛋白及其磷酸化位点，以及与之相关的上游激酶和相互作用蛋白。结果：在报道的代表KMT2D的173个磷酸化位点中，丝氨酸2274 （S2274）成为主要位点，在超过42%的基于质谱的磷酸化蛋白质组学数据集中被检测到。该位点位于KMT2D的一个独特的“LSPPP”基序中，表明其具有潜在的调控作用。对差异共调控蛋白磷酸化位点的详细研究表明，KMT2D在S2274位点的磷酸化与MAPK1/ERK2激活以及参与MAPK级联、表观遗传调控和细胞分化的蛋白一致且正向共调控。值得注意的是，ERK2被预测为S2274的上游激酶，这表明KMT2D S2274作为MEK-ERK信号通路的潜在下游效应物，可能与表观遗传调控和细胞分化有关。此外，我们的研究结果强调了S2274磷酸化中断与歌舞伎综合征发病机制之间的潜在机制联系。本研究描述了KMT2D的磷酸化调控网络，将其定位为由MEK-ERK信号调节的动态表观遗传效应，对癌症和发育障碍具有更广泛的意义。

{"title":"Role of histone-lysine N-methyltransferase 2D (KMT2D) in MEK-ERK signaling-mediated epigenetic regulation: a phosphoproteomics perspective.","authors":"Sreeshma Ravindran Kammarambath, Leona Dcunha, Athira Perunelly Gopalakrishnan, Amal Fahma, Neelam Krishna, Altaf Mahin, Samseera Ummar, Prathik Basthikoppa Shivamurthy, Inamul Hasan Madar, Rajesh Raju","doi":"10.3389/fbinf.2025.1683469","DOIUrl":"10.3389/fbinf.2025.1683469","url":null,"abstract":"Introduction: Histone-lysine N-methyltransferase 2D (KMT2D) is an H3K4 methyltransferase and a potential tumor suppressor with a crucial role in regulating gene expression. Its dysregulation has been implicated in developmental disorders and several types of cancers. Despite this, the molecular mechanisms that govern its activity remain largely elusive. Among these, post-translational modifications, especially phosphorylation, serve as an essential regulator, fine-tuning KMT2D stability, localization and functional interactions for maintaining cellular homeostasis. With over 173 phosphorylation sites reported, KMT2D is significantly regulated by kinases and exploring its phospho-regulatory network based on targeted in vitro approaches is challenging.Methods: We systematically curated and integrated the global phosphoproteomic datasets, along with their corresponding experimental conditions, to comprehensively identify the phosphorylation events reported for KMT2D. The site exhibiting the highest frequency of detection across these datasets is considered the predominant phosphorylation site. To investigate its functional significance, we analyzed the proteins and their phosphorylation sites that are differentially co-regulated with the predominant site, as well as its associated upstream kinases and interacting proteins.Results: Among the 173 reported phosphorylation sites representing KMT2D, Serine 2274 (S2274) emerged as the predominant site being detected in over 42% of diverse mass spectrometry-based phosphoproteomics datasets. This site lies within one of KMT2D's unique \"LSPPP\" motifs, suggesting a potential regulatory role. Detailed investigation on the differentially co-regulated protein phosphosites revealed the phosphorylation of KMT2D at S2274 is consistently and positively co-regulated with MAPK1/ERK2 activation, as well as with the proteins involved in the MAPK cascade, epigenetic regulation and cell differentiation. Notably, ERK2 was predicted as an upstream kinase targeting S2274, suggesting that KMT2D S2274 functions as a potential downstream effector of MEK-ERK signaling pathway, potentially linking to epigenetic regulation and cell differentiation. Further, our results highlighted a potential mechanistic link between disrupted phosphorylation at S2274 and the pathogenesis of Kabuki syndrome.Discussion: This study delineates the phosphoregulatory network of KMT2D, positioning it as a dynamic epigenetic effector modulated by MEK-ERK signaling, with broader implications for cancer and developmental disorders.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1683469"},"PeriodicalIF":3.9,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12669113/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145672858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning-guided optimization of triple agonist peptide therapeutics for metabolic disease. 机器学习引导下代谢疾病三联激动肽疗法的优化。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-17 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1687617

Anthony Wong, Sanskruthi Guduri, TsungYen Chen, Kunal Patel

Introduction: Multi-target peptide therapeutics targeting glucagon receptor (GCGR), glucagon-like peptide-1 receptor (GLP1R), and glucose-dependent insulinotropic polypeptide receptor (GIPR) represent a promising approach for treating diabetes and obesity. Triple agonist peptides demonstrate promising therapeutic potential compared to single-target approaches, yet rational design remains computationally challenging due to complex sequence-structure activity relationships. Existing methods, primarily based on convolutional neural networks, impose limitations including fixed sequence lengths and inadequate representation of molecular topology. Graph Attention Networks (GAT) offer advantages in capturing molecular structures and variable-length peptide sequences while providing interpretable insights into receptor-specific binding determinants.

Methods: A dataset of 234 peptide sequences with experimentally determined binding affinities was compiled from multiple sources. Peptides were represented as molecular graphs with seven-dimensional node features encoding physicochemical properties and positional information. The GAT architecture employed a shared encoder with task-specific prediction heads, implementing transfer learning to address limited GIPR training data. Performance was evaluated using 5-fold cross-validation and independent validation on 24 literature-derived sequences. A genetic algorithm framework was developed for peptide sequence optimization, incorporating multi objective fitness evaluation based on predicted binding affinity, biological plausibility, and sequence novelty.

Results: Cross-validation demonstrated robust GAT performance across all receptors, with GCGR achieving high accuracy (AUC ROC: 0.915 ± 0.050), followed by GLP1R (AUC-ROC: 0.853 ± 0.059), and GIPR showing acceptable performance despite limited data (AUC-ROC: 0.907 ± 0.083). Comparative analysis revealed receptor-specific advantages: GAT significantly outperformed CNN for GCGR prediction (RMSE: 0.942 vs. 1.209, p = 0.0013), while CNN maintained superior GLP1R performance (RMSE: 0.552 vs. 0.723). Genetic algorithm optimization measurable improvement over baseline, with 4.0% fitness Enhancement and generation of 20 candidates exhibiting mean binding probabilities exceeding 0.5 across all targets. The GAT-based framework provides a computational approach in computational peptide design, demonstrating receptor-specific advantages and robust optimization capabilities.

Conclusion: Genetic algorithm optimization enables systematic exploration of sequence space within existing agonist scaffolds while maintaining biological constraints. This approach provides a rational framework for prioritizing experimental validation efforts in triple agonist development.

介绍：针对胰高血糖素受体（GCGR）、胰高血糖素样肽-1受体（GLP1R）和葡萄糖依赖性胰岛素性多肽受体（GIPR）的多靶点肽治疗是治疗糖尿病和肥胖的一种很有前景的方法。与单靶点方法相比，三重激动剂肽显示出有希望的治疗潜力，但由于复杂的序列-结构-活性关系，合理的设计在计算上仍然具有挑战性。现有的方法，主要基于卷积神经网络，施加限制，包括固定的序列长度和分子拓扑的不充分表示。图注意网络（GAT）在捕获分子结构和变长肽序列方面具有优势，同时为受体特异性结合决定因素提供了可解释的见解。方法：从多个来源收集经实验确定结合亲和力的234条肽序列。多肽被表示为具有7维节点特征的分子图，这些节点特征编码了多肽的物理化学性质和位置信息。GAT架构采用具有特定任务预测头的共享编码器，实现迁移学习以解决有限的GIPR训练数据。使用5倍交叉验证和对24个文献衍生序列的独立验证来评估性能。基于预测结合亲和度、生物合理性和序列新颖性的多目标适应度评估，构建了多肽序列优化的遗传算法框架。结果：交叉验证表明，GAT在所有受体上都表现良好，GCGR的准确度较高（AUC ROC: 0.915±0.050），GLP1R的AUC ROC: 0.853±0.059)，GIPR的AUC ROC: 0.907±0.083)，尽管数据有限，但仍表现良好。对比分析显示了受体特异性优势：GAT在GCGR预测方面明显优于CNN (RMSE: 0.942 vs. 1.209, p = 0.0013)，而CNN在GLP1R预测方面保持了优势（RMSE: 0.552 vs. 0.723）。遗传算法优化了可测量的基线改进，适应度增强4.0%，生成的20个候选对象在所有目标上的平均绑定概率超过0.5。基于gat的框架为计算肽设计提供了一种计算方法，展示了受体特异性优势和强大的优化能力。结论：遗传算法优化可以在保持生物约束的情况下，系统地探索现有激动剂支架内的序列空间。这种方法为在三联激动剂开发中优先考虑实验验证工作提供了合理的框架。

{"title":"Machine learning-guided optimization of triple agonist peptide therapeutics for metabolic disease.","authors":"Anthony Wong, Sanskruthi Guduri, TsungYen Chen, Kunal Patel","doi":"10.3389/fbinf.2025.1687617","DOIUrl":"10.3389/fbinf.2025.1687617","url":null,"abstract":"Introduction: Multi-target peptide therapeutics targeting glucagon receptor (GCGR), glucagon-like peptide-1 receptor (GLP1R), and glucose-dependent insulinotropic polypeptide receptor (GIPR) represent a promising approach for treating diabetes and obesity. Triple agonist peptides demonstrate promising therapeutic potential compared to single-target approaches, yet rational design remains computationally challenging due to complex sequence-structure activity relationships. Existing methods, primarily based on convolutional neural networks, impose limitations including fixed sequence lengths and inadequate representation of molecular topology. Graph Attention Networks (GAT) offer advantages in capturing molecular structures and variable-length peptide sequences while providing interpretable insights into receptor-specific binding determinants.Methods: A dataset of 234 peptide sequences with experimentally determined binding affinities was compiled from multiple sources. Peptides were represented as molecular graphs with seven-dimensional node features encoding physicochemical properties and positional information. The GAT architecture employed a shared encoder with task-specific prediction heads, implementing transfer learning to address limited GIPR training data. Performance was evaluated using 5-fold cross-validation and independent validation on 24 literature-derived sequences. A genetic algorithm framework was developed for peptide sequence optimization, incorporating multi objective fitness evaluation based on predicted binding affinity, biological plausibility, and sequence novelty.Results: Cross-validation demonstrated robust GAT performance across all receptors, with GCGR achieving high accuracy (AUC ROC: 0.915 ± 0.050), followed by GLP1R (AUC-ROC: 0.853 ± 0.059), and GIPR showing acceptable performance despite limited data (AUC-ROC: 0.907 ± 0.083). Comparative analysis revealed receptor-specific advantages: GAT significantly outperformed CNN for GCGR prediction (RMSE: 0.942 vs. 1.209, p = 0.0013), while CNN maintained superior GLP1R performance (RMSE: 0.552 vs. 0.723). Genetic algorithm optimization measurable improvement over baseline, with 4.0% fitness Enhancement and generation of 20 candidates exhibiting mean binding probabilities exceeding 0.5 across all targets. The GAT-based framework provides a computational approach in computational peptide design, demonstrating receptor-specific advantages and robust optimization capabilities.Conclusion: Genetic algorithm optimization enables systematic exploration of sequence space within existing agonist scaffolds while maintaining biological constraints. This approach provides a rational framework for prioritizing experimental validation efforts in triple agonist development.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1687617"},"PeriodicalIF":3.9,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12665757/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145662025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QSAR-guided discovery of novel KRAS inhibitors for lung cancer therapy. qsar引导下发现肺癌治疗的新型KRAS抑制剂。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-17 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1663846

Osasan Stephen Adebayo, George Oche Ambrose, Daramola Olusola, Adefolalu Oluwafemi, Hind A Alzahrani, Abdulkarim Hasan

Introduction: KRAS mutations are key oncogenic drivers in lung cancer, yet effective pharmacological targeting has remained a major challenge due to the protein's elusive and dynamic binding pockets. Computational modeling offers a promising route to identify novel inhibitors with improved potency and selectivity.

Methods: A quantitative structure-activity relationship (QSAR) modeling approach was developed to predict the inhibitory potency (pIC₅₀) of KRAS inhibitors and support de novo drug design. Molecular descriptors for 62 inhibitors retrieved from the ChEMBL database (CHEMBL4354832) were computed using Chemopy. Following descriptor normalization and dimensionality reduction, five machine learning algorithm spartial least squares (PLS), random forest (RF), stepwise multiple linear regression (MLR), genetic algorithm optimized MLR (GA-MLR), and XGBoost were applied. Model performance was evaluated using R ², RMSE, and MAE, while permutation-based importance and SHAP analyses provided feature interpretability.

Results: Among the models tested, PLS exhibited the best predictive performance (R ² = 0.851; RMSE = 0.292), followed by RF (R ² = 0.796). The GA-MLR model, based on eight optimized molecular descriptors, achieved good interpretability and robust internal validation (R ² = 0.677). Virtual screening of 56 de novo designed compounds within the model's applicability domain identified compound C9 with a predicted pIC₅₀) of 8.11 as the most promising hit.

Discussion: This integrative QSAR modeling and de novo design framework effectively predicted the bioactivity of KRAS inhibitors and facilitated the identification of novel candidate molecules. The findings demonstrate the utility of combining interpretable machine learning models with virtual screening to accelerate the discovery of potent KRAS inhibitors for lung cancer therapy.

KRAS突变是肺癌的关键致癌驱动因素，但由于该蛋白难以捉摸且动态结合口袋，有效的药物靶向仍然是一个主要挑战。计算模型提供了一个有希望的途径，以确定新的抑制剂与提高效力和选择性。方法：建立定量构效关系（QSAR）模型，预测KRAS抑制剂的抑制效价（pIC50），为新药物设计提供支持。使用Chemopy计算从ChEMBL数据库（CHEMBL4354832）检索的62种抑制剂的分子描述符。在描述符归一化和降维之后，采用了五种机器学习算法偏最小二乘（PLS）、随机森林（RF）、逐步多元线性回归（MLR）、遗传算法优化的MLR （GA-MLR）和XGBoost。使用r2、RMSE和MAE评估模型性能，而基于排列的重要性和SHAP分析提供特征可解释性。结果：PLS的预测效果最好（r2 = 0.851; RMSE = 0.292），其次是RF （r2 = 0.796）。基于8个优化的分子描述符的GA-MLR模型具有良好的可解释性和稳健的内部验证（r2 = 0.677）。在模型适用范围内对56个从头设计的化合物进行虚拟筛选，发现化合物C9的pIC50预测值为8.11，是最有希望的候选化合物。讨论：这种整合的QSAR建模和从头设计框架有效地预测了KRAS抑制剂的生物活性，并促进了新的候选分子的鉴定。研究结果表明，将可解释的机器学习模型与虚拟筛选相结合，可以加速发现用于肺癌治疗的有效KRAS抑制剂。

{"title":"QSAR-guided discovery of novel KRAS inhibitors for lung cancer therapy.","authors":"Osasan Stephen Adebayo, George Oche Ambrose, Daramola Olusola, Adefolalu Oluwafemi, Hind A Alzahrani, Abdulkarim Hasan","doi":"10.3389/fbinf.2025.1663846","DOIUrl":"10.3389/fbinf.2025.1663846","url":null,"abstract":"Introduction: KRAS mutations are key oncogenic drivers in lung cancer, yet effective pharmacological targeting has remained a major challenge due to the protein's elusive and dynamic binding pockets. Computational modeling offers a promising route to identify novel inhibitors with improved potency and selectivity.Methods: A quantitative structure-activity relationship (QSAR) modeling approach was developed to predict the inhibitory potency (pIC50) of KRAS inhibitors and support de novo drug design. Molecular descriptors for 62 inhibitors retrieved from the ChEMBL database (CHEMBL4354832) were computed using Chemopy. Following descriptor normalization and dimensionality reduction, five machine learning algorithm spartial least squares (PLS), random forest (RF), stepwise multiple linear regression (MLR), genetic algorithm optimized MLR (GA-MLR), and XGBoost were applied. Model performance was evaluated using R 2, RMSE, and MAE, while permutation-based importance and SHAP analyses provided feature interpretability.Results: Among the models tested, PLS exhibited the best predictive performance (R 2 = 0.851; RMSE = 0.292), followed by RF (R 2 = 0.796). The GA-MLR model, based on eight optimized molecular descriptors, achieved good interpretability and robust internal validation (R 2 = 0.677). Virtual screening of 56 de novo designed compounds within the model's applicability domain identified compound C9 with a predicted pIC50) of 8.11 as the most promising hit.Discussion: This integrative QSAR modeling and de novo design framework effectively predicted the bioactivity of KRAS inhibitors and facilitated the identification of novel candidate molecules. The findings demonstrate the utility of combining interpretable machine learning models with virtual screening to accelerate the discovery of potent KRAS inhibitors for lung cancer therapy.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1663846"},"PeriodicalIF":3.9,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12665777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145662629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comprehensive analysis of multi-omics vaccine response data using MOFA and Stabl algorithms. 基于MOFA和Stabl算法的多组学疫苗应答数据综合分析

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2025-11-13 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1636240

Aanya Gupta, Koji Abe, Holden T Maecker

Introduction: FluPRINT is a multi-omics dataset that measures donors' protein expression and cell counts across various assays. Donors were also assigned a binary value (0 or 1), being labeled as high responders if they had a fold change ≥4 of the antibody titer for hemagglutination inhibition (HAI) from day 0 to day 28, and low responders otherwise (0). In this project, we used the MOFA and Stabl algorithms to analyze FluPRINT, estimate the population structure from the data, and identify the most important features for predicting response to the vaccine.

Methods: The preprocessing of the dataset included removing repeat features, scaling by assay, and removing outliers. Since Stabl does not directly address missing values, features with high amounts of missing values were removed and the remaining were ignored.

Results: MOFA identified the top feature in structure extraction as IL neg 2 CD4 pos CD45Ra neg pSTAT5. MOFA explains well the variance of the data while also choosing features that have good significance, as illustrated by their significant p-values (p < 0.05). Stabl found the top feature for explaining the outcome to be CD33^- CD3⁺ CD4⁺ CD25hiCD127low CD161+ CD45RA + Tregs, which matched the top result of previously published analysis. MOFA's features achieved an AUROC of 0.616 (95% CI of 0.426-0.806), and Stabl's achieved an AUROC of 0.634 (95% CI of 0.432-0.823).

Discussion: Our research addresses a key knowledge gap: understanding how these fundamentally different analytical approaches perform when analyzing the same complex dataset. Our exploration evaluates their respective strengths, limitations, and biological insights and provides guidance on using MOFA and Stabl to find the best predictive cell subsets and features for understanding large immunological multi-omics data. The code for this project can be found at https://github.com/aanya21gupta/fluprint.

简介：FluPRINT是一个多组学数据集，通过各种分析测量供体的蛋白质表达和细胞计数。供者也被分配一个二元值（0或1），如果他们在第0天至第28天血凝抑制（HAI）抗体滴度的变化倍数≥4，则被标记为高反应者，否则被标记为低反应者(0)。在这个项目中，我们使用MOFA和Stabl算法来分析FluPRINT，从数据中估计种群结构，并确定预测疫苗反应的最重要特征。方法：对数据集进行预处理，包括去除重复特征、测定缩放和去除异常值。由于Stabl不直接处理缺失值，因此删除了大量缺失值的特征，其余的被忽略。结果：MOFA鉴定出结构提取的最高特征为IL - 2 CD4 + CD45Ra - pSTAT5。MOFA很好地解释了数据的方差，同时也选择了显著性好的特征，其显著p值（p < 0.05）说明了这一点。Stabl发现解释结果的顶级特征是CD33- CD3+ CD4+ CD25hiCD127low CD161+ CD45RA + Tregs，这与之前发表的分析结果相匹配。MOFA的AUROC为0.616 (95% CI为0.426 ~ 0.806)，Stabl的AUROC为0.634 （95% CI为0.432 ~ 0.823）。讨论：我们的研究解决了一个关键的知识鸿沟：理解这些根本不同的分析方法在分析相同的复杂数据集时是如何执行的。我们的研究评估了它们各自的优势、局限性和生物学见解，并为使用MOFA和Stabl寻找最佳预测细胞亚群和特征以理解大型免疫多组学数据提供了指导。这个项目的代码可以在https://github.com/aanya21gupta/fluprint上找到。

{"title":"Comprehensive analysis of multi-omics vaccine response data using MOFA and Stabl algorithms.","authors":"Aanya Gupta, Koji Abe, Holden T Maecker","doi":"10.3389/fbinf.2025.1636240","DOIUrl":"10.3389/fbinf.2025.1636240","url":null,"abstract":"Introduction: FluPRINT is a multi-omics dataset that measures donors' protein expression and cell counts across various assays. Donors were also assigned a binary value (0 or 1), being labeled as high responders if they had a fold change ≥4 of the antibody titer for hemagglutination inhibition (HAI) from day 0 to day 28, and low responders otherwise (0). In this project, we used the MOFA and Stabl algorithms to analyze FluPRINT, estimate the population structure from the data, and identify the most important features for predicting response to the vaccine.Methods: The preprocessing of the dataset included removing repeat features, scaling by assay, and removing outliers. Since Stabl does not directly address missing values, features with high amounts of missing values were removed and the remaining were ignored.Results: MOFA identified the top feature in structure extraction as IL neg 2 CD4 pos CD45Ra neg pSTAT5. MOFA explains well the variance of the data while also choosing features that have good significance, as illustrated by their significant p-values (p < 0.05). Stabl found the top feature for explaining the outcome to be CD33- CD3+ CD4+ CD25hiCD127low CD161+ CD45RA + Tregs, which matched the top result of previously published analysis. MOFA's features achieved an AUROC of 0.616 (95% CI of 0.426-0.806), and Stabl's achieved an AUROC of 0.634 (95% CI of 0.432-0.823).Discussion: Our research addresses a key knowledge gap: understanding how these fundamentally different analytical approaches perform when analyzing the same complex dataset. Our exploration evaluates their respective strengths, limitations, and biological insights and provides guidance on using MOFA and Stabl to find the best predictive cell subsets and features for understanding large immunological multi-omics data. The code for this project can be found at https://github.com/aanya21gupta/fluprint.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1636240"},"PeriodicalIF":3.9,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657425/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145649743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0