首页 > 最新文献

Frontiers in bioinformatics最新文献

英文 中文
CoMPHI: a novel composite machine learning approach utilizing multiple feature representation to predict hosts of bacteriophages. CoMPHI:一种新的复合机器学习方法,利用多特征表示来预测噬菌体的宿主。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-16 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1622931
Shreyashi Bodaka, Narasaiah Kolliputi

Phage therapy has reemerged as a compelling alternative to antibiotics in treating bacterial infections, especially for superbugs that have developed antibiotic resistance. The challenge in the broader application of phage therapy is identifying host targets for the vast array of uncharacterized phages obtained through next-generation sequencing. We introduce a Composite Model for Phage Host Interaction (CoMPHI) that integrates alignment-based approaches with machine learning. The model generates multiple feature encodings from nucleotide and protein sequences of both phages and hosts. It incorporates alignment scores between phage-phage, phage-host, and host-host pairs, creating a composite prediction framework. During 5-fold cross-validation, CoMPHI achieved Area Under the ROC Curve (AUC-ROC) values of 94-96.7% and accuracies of 92.3-95.1% across taxonomic levels from species to phylum. Comparative analysis showed a 6-8% performance improvement when alignment scores were included. Ablation studies demonstrated that combining nucleotide and protein encodings, along with phage-host, host-host, and phage-phage alignment scores, significantly enhanced prediction accuracy. CoMPHI provides a robust and comprehensive framework for predicting phage-host interactions. By combining sequence features and alignment information, the model advances computational tools that can accelerate the application of phage therapy in modern medicine.

噬菌体疗法已经重新成为治疗细菌感染的一种令人信服的替代抗生素,特别是对于已经产生抗生素耐药性的超级细菌。噬菌体治疗更广泛应用的挑战是为通过下一代测序获得的大量未表征噬菌体确定宿主靶标。我们介绍了噬菌体宿主相互作用的复合模型(CoMPHI),该模型集成了基于对齐的方法和机器学习。该模型从噬菌体和宿主的核苷酸和蛋白质序列中生成多个特征编码。它结合了噬菌体、噬菌体-宿主和宿主-宿主对之间的比对得分,创建了一个复合预测框架。在5次交叉验证中,CoMPHI的ROC曲线下面积(AUC-ROC)值为94 ~ 96.7%,准确度为92.3 ~ 95.1%。对比分析显示,当包括对齐分数时,性能提高了6-8%。消融研究表明,结合核苷酸和蛋白质编码,以及噬菌体-宿主、宿主-宿主和噬菌体-噬菌体比对评分,可显著提高预测准确性。CoMPHI为预测噬菌体-宿主相互作用提供了一个强大而全面的框架。通过结合序列特征和比对信息,该模型推进了计算工具,可以加速噬菌体治疗在现代医学中的应用。
{"title":"CoMPHI: a novel composite machine learning approach utilizing multiple feature representation to predict hosts of bacteriophages.","authors":"Shreyashi Bodaka, Narasaiah Kolliputi","doi":"10.3389/fbinf.2025.1622931","DOIUrl":"10.3389/fbinf.2025.1622931","url":null,"abstract":"<p><p>Phage therapy has reemerged as a compelling alternative to antibiotics in treating bacterial infections, especially for superbugs that have developed antibiotic resistance. The challenge in the broader application of phage therapy is identifying host targets for the vast array of uncharacterized phages obtained through next-generation sequencing. We introduce a Composite Model for Phage Host Interaction (CoMPHI) that integrates alignment-based approaches with machine learning. The model generates multiple feature encodings from nucleotide and protein sequences of both phages and hosts. It incorporates alignment scores between phage-phage, phage-host, and host-host pairs, creating a composite prediction framework. During 5-fold cross-validation, CoMPHI achieved Area Under the ROC Curve (AUC-ROC) values of 94-96.7% and accuracies of 92.3-95.1% across taxonomic levels from species to phylum. Comparative analysis showed a 6-8% performance improvement when alignment scores were included. Ablation studies demonstrated that combining nucleotide and protein encodings, along with phage-host, host-host, and phage-phage alignment scores, significantly enhanced prediction accuracy. CoMPHI provides a robust and comprehensive framework for predicting phage-host interactions. By combining sequence features and alignment information, the model advances computational tools that can accelerate the application of phage therapy in modern medicine.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1622931"},"PeriodicalIF":3.9,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571911/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal knowledge expansion widget powered by plant protein phosphorylation database and ChatGPT. 由植物蛋白磷酸化数据库和ChatGPT驱动的多模式知识扩展小部件。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-15 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1687687
Chunhui Xu, Yang Yu, Govardhan Khadakkar, Jiacheng Xie, Dong Xu, Qiuming Yao

Biological databases are essential for providing curated knowledge, but their rigid data structures and restrictive query formats often limit flexible and exploratory user interactions. In the field of plant phosphorylation, manually curated and reviewed data represent only a small portion of the available knowledge, and users often seek information that goes beyond what is provided in structured databases. While large language models (LLMs) like ChatGPT-4o possess extensive contextual knowledge, integrating this capability into bioinformatics tools remains an open challenge. Here, we present a multimodal question-answering widget that integrates ChatGPT-4o with our Plant Protein Phosphorylation Database (P3DB). This system supports natural language queries and dynamic prompt formulation, enabling users to explore phosphorylation events, kinase-substrate relationships, and protein-protein interactions through a global entry. In another application, the widget leverages ChatGPT's image interpretation functionality to extract regulatory pathways and phosphorylation markers from complex scientific figures. To build this widget effectively, we have explored multiple prompt strategies, including one-step, two-step, few-shot, and image-cropping techniques, demonstrating their impact on output accuracy and consistency. In addition, recent multimodal LLMs such as ChatGPT-5 and Gemini 1.5 have demonstrated comparable capabilities and adaptability when applied to our test cases and the developed widgets. Together, our application widget and results highlight the development of the ChatGPT-P3DB integration as a system that enhances user accessibility, enables visual extraction, and extends the current utility of biological knowledgebases through a flexible and adaptive framework. Our "ChatGPT-P3DB" is open-source and can be accessed on GitHub (https://github.com/yao-laboratory/p3db-chat). The frontend interface, "P3DB askAI" web module, can be accessed freely through https://www.p3db.org/ask-ai.

生物数据库对于提供有条理的知识是必不可少的,但是它们严格的数据结构和限制性的查询格式往往限制了灵活和探索性的用户交互。在植物磷酸化领域,人工整理和审查的数据只代表了可用知识的一小部分,用户经常寻求超出结构化数据库所提供的信息。虽然像chatgpt - 40这样的大型语言模型(llm)拥有广泛的上下文知识,但将这种能力集成到生物信息学工具中仍然是一个开放的挑战。在这里,我们提出了一个多模式问答小部件,它将chatgpt - 40与我们的植物蛋白磷酸化数据库(P3DB)集成在一起。该系统支持自然语言查询和动态提示公式,使用户能够通过全局入口探索磷酸化事件,激酶-底物关系和蛋白质-蛋白质相互作用。在另一个应用程序中,该小部件利用ChatGPT的图像解释功能从复杂的科学数据中提取调控途径和磷酸化标记。为了有效地构建这个小部件,我们探索了多种提示策略,包括一步、两步、少拍和图像裁剪技术,展示了它们对输出准确性和一致性的影响。此外,最近的多模式法学硕士,如ChatGPT-5和Gemini 1.5,在应用于我们的测试用例和开发的小部件时,已经展示了相当的能力和适应性。总之,我们的应用程序小部件和结果突出了ChatGPT-P3DB集成作为一个系统的开发,该系统增强了用户可访问性,支持可视化提取,并通过灵活和自适应的框架扩展了生物知识库的当前效用。我们的“ChatGPT-P3DB”是开源的,可以在GitHub (https://github.com/yao-laboratory/p3db-chat)上访问。前端界面“P3DB askAI”web模块可通过https://www.p3db.org/ask-ai免费访问。
{"title":"Multimodal knowledge expansion widget powered by plant protein phosphorylation database and ChatGPT.","authors":"Chunhui Xu, Yang Yu, Govardhan Khadakkar, Jiacheng Xie, Dong Xu, Qiuming Yao","doi":"10.3389/fbinf.2025.1687687","DOIUrl":"10.3389/fbinf.2025.1687687","url":null,"abstract":"<p><p>Biological databases are essential for providing curated knowledge, but their rigid data structures and restrictive query formats often limit flexible and exploratory user interactions. In the field of plant phosphorylation, manually curated and reviewed data represent only a small portion of the available knowledge, and users often seek information that goes beyond what is provided in structured databases. While large language models (LLMs) like ChatGPT-4o possess extensive contextual knowledge, integrating this capability into bioinformatics tools remains an open challenge. Here, we present a multimodal question-answering widget that integrates ChatGPT-4o with our Plant Protein Phosphorylation Database (P3DB). This system supports natural language queries and dynamic prompt formulation, enabling users to explore phosphorylation events, kinase-substrate relationships, and protein-protein interactions through a global entry. In another application, the widget leverages ChatGPT's image interpretation functionality to extract regulatory pathways and phosphorylation markers from complex scientific figures. To build this widget effectively, we have explored multiple prompt strategies, including one-step, two-step, few-shot, and image-cropping techniques, demonstrating their impact on output accuracy and consistency. In addition, recent multimodal LLMs such as ChatGPT-5 and Gemini 1.5 have demonstrated comparable capabilities and adaptability when applied to our test cases and the developed widgets. Together, our application widget and results highlight the development of the ChatGPT-P3DB integration as a system that enhances user accessibility, enables visual extraction, and extends the current utility of biological knowledgebases through a flexible and adaptive framework. Our \"ChatGPT-P3DB\" is open-source and can be accessed on GitHub (https://github.com/yao-laboratory/p3db-chat). The frontend interface, \"P3DB askAI\" web module, can be accessed freely through https://www.p3db.org/ask-ai.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1687687"},"PeriodicalIF":3.9,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12568720/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of breast region segmentation in thermal images using U-Net deep neural network variants. 基于U-Net深度神经网络的热图像乳房区域分割分析。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-10 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1609004
Rafhanah Shazwani Rosli, Mohamed Hadi Habaebi, Md Rafiqul Islam, Mohammed Abdulla Salim Al Hussaini

Introduction: Breast cancer detection using thermal imaging relies on accurate segmentation of the breast region from adjacent body areas. Reliable segmentation is essential to improve the effectiveness of computer-aided diagnosis systems.

Methods: This study evaluated three segmentation models-U-Net, U-Net with Spatial Attention, and U-Net++-using five optimization algorithms (ADAM, NADAM, RMSPROP, SGDM, and ADADELTA). Performance was assessed through k-fold cross-validation with metrics including Intersection over Union (IoU), Dice coefficient, precision, recall, sensitivity, specificity, pixel accuracy, ROC-AUC, PR-AUC, and Grad-CAM heatmaps for qualitative analysis.

Results: The ADAM optimizer consistently outperformed the others, yielding superior accuracy and reduced loss. Among the models, the baseline U-Net, despite being less complex, demonstrated the most effective performance, with precision of 0.9721, recall of 0.9559, specificity of 0.9801, ROC-AUC of 0.9680, and PR-AUC of 0.9472. U-Net also achieved higher robustness in breast region overlap and noise handling compared to its more complex variants. The findings indicate that greater architectural complexity does not necessarily lead to improved outcomes.

Discussion: This research highlights that the original U-Net, when trained with the ADAM optimizer, remains highly effective for breast region segmentation in thermal images. The insights contribute to guiding the selection of suitable deep learning models and optimizers for medical image analysis, with the potential to enhance the efficiency and accuracy of breast cancer diagnosis using thermal imaging.

介绍:使用热成像检测乳腺癌依赖于乳房区域与邻近身体区域的准确分割。可靠的分割是提高计算机辅助诊断系统有效性的关键。方法:采用5种优化算法(ADAM、NADAM、RMSPROP、SGDM和ADADELTA)对U-Net、U-Net带空间注意和U-Net++ 3种分割模型进行了评价。通过k-fold交叉验证评估性能,指标包括交叉交叉(IoU)、Dice系数、精度、召回率、灵敏度、特异性、像素精度、ROC-AUC、PR-AUC和Grad-CAM热图进行定性分析。结果:ADAM优化器始终优于其他优化器,产生优越的准确性和减少损失。其中,基线U-Net模型虽然复杂度较低,但效果最好,其精密度为0.9721,召回率为0.9559,特异性为0.9801,ROC-AUC为0.9680,PR-AUC为0.9472。与更复杂的变体相比,U-Net在乳房区域重叠和噪声处理方面也取得了更高的鲁棒性。研究结果表明,更大的架构复杂性并不一定会带来更好的结果。讨论:本研究强调了原始的U-Net在经过ADAM优化器的训练后,对于热图像中的乳房区域分割仍然非常有效。这些见解有助于指导为医学图像分析选择合适的深度学习模型和优化器,有可能提高使用热成像诊断乳腺癌的效率和准确性。
{"title":"Analysis of breast region segmentation in thermal images using U-Net deep neural network variants.","authors":"Rafhanah Shazwani Rosli, Mohamed Hadi Habaebi, Md Rafiqul Islam, Mohammed Abdulla Salim Al Hussaini","doi":"10.3389/fbinf.2025.1609004","DOIUrl":"10.3389/fbinf.2025.1609004","url":null,"abstract":"<p><strong>Introduction: </strong>Breast cancer detection using thermal imaging relies on accurate segmentation of the breast region from adjacent body areas. Reliable segmentation is essential to improve the effectiveness of computer-aided diagnosis systems.</p><p><strong>Methods: </strong>This study evaluated three segmentation models-U-Net, U-Net with Spatial Attention, and U-Net++-using five optimization algorithms (ADAM, NADAM, RMSPROP, SGDM, and ADADELTA). Performance was assessed through k-fold cross-validation with metrics including Intersection over Union (IoU), Dice coefficient, precision, recall, sensitivity, specificity, pixel accuracy, ROC-AUC, PR-AUC, and Grad-CAM heatmaps for qualitative analysis.</p><p><strong>Results: </strong>The ADAM optimizer consistently outperformed the others, yielding superior accuracy and reduced loss. Among the models, the baseline U-Net, despite being less complex, demonstrated the most effective performance, with precision of 0.9721, recall of 0.9559, specificity of 0.9801, ROC-AUC of 0.9680, and PR-AUC of 0.9472. U-Net also achieved higher robustness in breast region overlap and noise handling compared to its more complex variants. The findings indicate that greater architectural complexity does not necessarily lead to improved outcomes.</p><p><strong>Discussion: </strong>This research highlights that the original U-Net, when trained with the ADAM optimizer, remains highly effective for breast region segmentation in thermal images. The insights contribute to guiding the selection of suitable deep learning models and optimizers for medical image analysis, with the potential to enhance the efficiency and accuracy of breast cancer diagnosis using thermal imaging.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1609004"},"PeriodicalIF":3.9,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12550958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unveiling the impact of interferon genes on the immune microenvironment of triple-negative breast cancer: identification of therapeutic targets. 揭示干扰素基因对三阴性乳腺癌免疫微环境的影响:治疗靶点的确定
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-08 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1629526
Ying Liu, Jiayi Cai, Aamir Fahira, Kai Zhuang, Jiaojiao Wang, Zhi Zhang, Lin Yan, Yong Liu, Defang Ouyang, Zunnan Huang

Objective: Triple-negative breast cancer (TNBC), a classic subtype of breast cancer, is challenging to treat due to the lack of drug-targeting receptors. This study aims to explore interferon-related prognostic molecular biomarkers in TNBC and their potential competing endogenous RNA (ceRNA) regulatory network in TNBC.

Methods: RNA expression profiles and interferon genes were downloaded from the Cancer Genome Atlas (TCGA) database and the Gene Set Enrichment Analysis (GSEA) website, respectively. Univariate and multivariate Cox regression analyses were performed to identify prognostic genes and construct a risk model. Single-sample GSEA (ssGSEA) and the CellMiner database were used to explore the relationships between prognostic genes and both tumor immune microenvironment and drug sensitivity, respectively. The lncRNA-miRNA-mRNA network associated with prognosis was constructed using the ENCORI database. Finally, the potential interferon-associated lncRNA/miRNA/mRNA regulatory axis was identified through correlation analysis. The abnormal expressions of prognostic genes were validated in three TNBC tumor cell lines compared to normal mammary epithelial cells by using quantitative real-time polymerase chain reaction (qRT-PCR).

Results: The TNBC prognostic signature comprising four interferon genes (STXBP1, LAMP3, CD276, and POLR2F) was identified, with their expression significantly correlated with the infiltration abundance of multiple immune cells and the drug sensitivity of 30 diverse drugs (ARQ-680, Fluphenazine, and Chelerythrine, etc.). Furthermore, an interferon-related genes prognostic ceRNA network was further constructed, consisting of 248 lncRNAs, 66 miRNAs, and 4 mRNAs. As a result, 5 interferon-related ceRNA regulatory axes (AC124067.4/hsa-miR-455-3p/STXBP1, RBPMS-AS1/hsa-miR-455-3p/STXBP1, DNMBP-AS1/hsa-miR-455-3p/STXBP1, FAM198B-AS1/hsa-miR-455-3p/STXBP1, LIFR-AS1/hsa-miR-455-3p/STXBP1) associated with TNBC progression were identified. QRT-PCR results showed that all four prognostic mRNAs were upregulated in TNBC cells.

Conclusion: This study established a prognostic signature and a ceRNA network associated with interferon in TNBC, and identified five key regulatory axes. In the prognostic signature and the ceRNA axes, STXBP1, RBPMS-AS1, and FAM198B-AS1 were first reported as potential biomarkers of TNBC. These findings have the potential to provide new insights into the mechanisms driving TNBC tumorigenesis and development.

目的:三阴性乳腺癌(TNBC)是一种典型的乳腺癌亚型,由于缺乏药物靶向受体,治疗具有挑战性。本研究旨在探索TNBC中与干扰素相关的预后分子生物标志物及其在TNBC中潜在的竞争性内源性RNA (ceRNA)调控网络。方法:分别从Cancer Genome Atlas (TCGA)数据库和Gene Set Enrichment Analysis (GSEA)网站下载RNA表达谱和干扰素基因。进行单因素和多因素Cox回归分析以确定预后基因并构建风险模型。单样本GSEA (ssGSEA)和CellMiner数据库分别用于探讨预后基因与肿瘤免疫微环境和药物敏感性之间的关系。利用ENCORI数据库构建与预后相关的lncRNA-miRNA-mRNA网络。最后,通过相关分析确定干扰素相关的潜在lncRNA/miRNA/mRNA调控轴。采用实时荧光定量聚合酶链反应(qRT-PCR)验证了三种TNBC肿瘤细胞系中与正常乳腺上皮细胞相比预后基因的异常表达。结果:鉴定出4个干扰素基因(STXBP1、LAMP3、CD276、POLR2F)的TNBC预后特征,其表达与多种免疫细胞浸润丰度及30种不同药物(ARQ-680、氟非那嗪、Chelerythrine等)的药物敏感性显著相关。进一步构建干扰素相关基因预后的ceRNA网络,包括248个lncrna、66个mirna和4个mrna。结果,鉴定出5个与TNBC进展相关的干扰素相关ceRNA调控轴(AC124067.4/hsa-miR-455-3p/STXBP1, RBPMS-AS1/hsa-miR-455-3p/STXBP1, DNMBP-AS1/hsa-miR-455-3p/STXBP1, FAM198B-AS1/hsa-miR-455-3p/STXBP1, LIFR-AS1/hsa-miR-455-3p/STXBP1)。QRT-PCR结果显示,所有四种预后mrna在TNBC细胞中均上调。结论:本研究在TNBC中建立了与干扰素相关的预后特征和ceRNA网络,并确定了5个关键调控轴。在预后特征和ceRNA轴中,STXBP1、RBPMS-AS1和FAM198B-AS1首次被报道为TNBC的潜在生物标志物。这些发现有可能为TNBC肿瘤发生和发展的机制提供新的见解。
{"title":"Unveiling the impact of interferon genes on the immune microenvironment of triple-negative breast cancer: identification of therapeutic targets.","authors":"Ying Liu, Jiayi Cai, Aamir Fahira, Kai Zhuang, Jiaojiao Wang, Zhi Zhang, Lin Yan, Yong Liu, Defang Ouyang, Zunnan Huang","doi":"10.3389/fbinf.2025.1629526","DOIUrl":"10.3389/fbinf.2025.1629526","url":null,"abstract":"<p><strong>Objective: </strong>Triple-negative breast cancer (TNBC), a classic subtype of breast cancer, is challenging to treat due to the lack of drug-targeting receptors. This study aims to explore interferon-related prognostic molecular biomarkers in TNBC and their potential competing endogenous RNA (ceRNA) regulatory network in TNBC.</p><p><strong>Methods: </strong>RNA expression profiles and interferon genes were downloaded from the Cancer Genome Atlas (TCGA) database and the Gene Set Enrichment Analysis (GSEA) website, respectively. Univariate and multivariate Cox regression analyses were performed to identify prognostic genes and construct a risk model. Single-sample GSEA (ssGSEA) and the CellMiner database were used to explore the relationships between prognostic genes and both tumor immune microenvironment and drug sensitivity, respectively. The lncRNA-miRNA-mRNA network associated with prognosis was constructed using the ENCORI database. Finally, the potential interferon-associated lncRNA/miRNA/mRNA regulatory axis was identified through correlation analysis. The abnormal expressions of prognostic genes were validated in three TNBC tumor cell lines compared to normal mammary epithelial cells by using quantitative real-time polymerase chain reaction (qRT-PCR).</p><p><strong>Results: </strong>The TNBC prognostic signature comprising four interferon genes (STXBP1, LAMP3, CD276, and POLR2F) was identified, with their expression significantly correlated with the infiltration abundance of multiple immune cells and the drug sensitivity of 30 diverse drugs (ARQ-680, Fluphenazine, and Chelerythrine, etc.). Furthermore, an interferon-related genes prognostic ceRNA network was further constructed, consisting of 248 lncRNAs, 66 miRNAs, and 4 mRNAs. As a result, 5 interferon-related ceRNA regulatory axes (AC124067.4/hsa-miR-455-3p/STXBP1, RBPMS-AS1/hsa-miR-455-3p/STXBP1, DNMBP-AS1/hsa-miR-455-3p/STXBP1, FAM198B-AS1/hsa-miR-455-3p/STXBP1, LIFR-AS1/hsa-miR-455-3p/STXBP1) associated with TNBC progression were identified. QRT-PCR results showed that all four prognostic mRNAs were upregulated in TNBC cells.</p><p><strong>Conclusion: </strong>This study established a prognostic signature and a ceRNA network associated with interferon in TNBC, and identified five key regulatory axes. In the prognostic signature and the ceRNA axes, STXBP1, RBPMS-AS1, and FAM198B-AS1 were first reported as potential biomarkers of TNBC. These findings have the potential to provide new insights into the mechanisms driving TNBC tumorigenesis and development.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1629526"},"PeriodicalIF":3.9,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12542738/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing clustering of CDR3 sequences using natural language processing, Word2Vec, and KMeans. 利用自然语言处理、Word2Vec和KMeans优化CDR3序列聚类。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-02 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1623488
Sanskriti Baranwal, Ricardo Avila Sanchez, Clement-Andi Edet, Erick Chastain, Inimary Toby

T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR β-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.

t细胞受体(TCR)测序已成为理解适应性免疫反应的有力工具,但在解读互补性决定区3 (CDR3)序列的巨大多样性方面仍然存在挑战。本研究提出了一种新的基于自然语言处理(NLP)的管道,利用Word2Vec嵌入、主成分分析(PCA)和KMeans聚类,从TCR β链库中对CDR3序列进行聚类。针对急性呼吸窘迫综合征(Acute Respiratory Distress Syndrome, ARDS)这一危及生命的炎症性肺部疾病,我们在健康对照上训练了Word2Vec模型,并在ARDS、非ARDS和对照数据集上应用无监督聚类。降维嵌入揭示了库结构的明显差异:对照样本表现出紧密、低多样性的聚类;急性呼吸窘迫综合征患者弥散度高,弥散性聚集多,提示储备系统破坏;非ards样品显示中间组织。这些差异表明免疫激活状态嵌入在CDR3空间的结构拓扑中。我们的框架成功捕获了这些潜在的模式,为生物标志物的发现提供了一种可扩展的方法。这项研究不仅加强了NLP在免疫学分析中的应用,而且为重症监护和个性化诊断中的数据驱动免疫监测铺平了道路。
{"title":"Optimizing clustering of CDR3 sequences using natural language processing, Word2Vec, and KMeans.","authors":"Sanskriti Baranwal, Ricardo Avila Sanchez, Clement-Andi Edet, Erick Chastain, Inimary Toby","doi":"10.3389/fbinf.2025.1623488","DOIUrl":"10.3389/fbinf.2025.1623488","url":null,"abstract":"<p><p>T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR β-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1623488"},"PeriodicalIF":3.9,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12528129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EPheClass: ensemble-based phenotype classifier from 16S rRNA gene sequences. epeclass:基于集成的16S rRNA基因序列表型分类器。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-30 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1514880
Lara Vázquez-González, Carlos Peña-Reyes, Alba Regueira-Iglesias, Carlos Balsa-Castro, Inmaculada Tomás, María J Carreira

One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at https://gitlab.citius.usc.es/lara.vazquez/epheclass.

生物信息学的一个领域目前特别吸引人的兴趣是使用机器学习(ML)对多微生物疾病进行分类,其数据来自人类微生物组样本中16S rRNA基因的高通量扩增子测序。这些类型疾病背后的微生物生态失调尤其具有挑战性,因为数据是高度多维的,可能有数百甚至数千个预测特征。此外,微生物群落组成的不平衡在不同样品中是高度异质性的。在本文中,我们提出了一个基于16S rRNA基因扩增子计数表的二元表型分类管道,该管道可应用于任何微生物组。为了评估我们的建议,从公共存储库下载了健康和牙周影响的口腔微生物组样本中符合一定质量标准的原始16S rRNA基因序列。最后,总共分析了2581个样本。在我们的方法中,我们首先使用特征选择方法降低数据的维数。在调整和评估使用动态集成选择(DES)技术创建的不同机器学习(ML)模型和集成后,我们发现所有DES模型的表现相似,并且比单个模型更健壮。虽然与其他方法的差异很小,但DES-P获得了最高的AUC,因此在我们的分析中被选为代表性技术。当唾液样本诊断牙周病时,仅13个特征的F1得分为0.913,精密度为0.881,召回率(灵敏度)为0.947,准确度为0.929,AUC为0.973。此外,我们使用EPheClass来诊断炎症性肠病(IBD),并获得了比使用相同数据集的其他文献更好的结果。我们还评估了它在检测抗生素暴露方面的有效性,再次展示了具有竞争力的结果。这突出了我们的分类方法的重要性和概括性方面,这适用于不同的表型,研究利基和样本类型。代码可在https://gitlab.citius.usc.es/lara.vazquez/epheclass上获得。
{"title":"EPheClass: ensemble-based phenotype classifier from 16S rRNA gene sequences.","authors":"Lara Vázquez-González, Carlos Peña-Reyes, Alba Regueira-Iglesias, Carlos Balsa-Castro, Inmaculada Tomás, María J Carreira","doi":"10.3389/fbinf.2025.1514880","DOIUrl":"10.3389/fbinf.2025.1514880","url":null,"abstract":"<p><p>One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at https://gitlab.citius.usc.es/lara.vazquez/epheclass.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1514880"},"PeriodicalIF":3.9,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12518240/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145304801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering molecules and plants with potential activity against gastric cancer: an in silico ensemble-based modeling analysis. 发现具有潜在抗胃癌活性的分子和植物:基于硅集成的建模分析。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-30 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1642039
Micaela Villacrés, Alec Avila, Karina Jimenes-Vargas, António Machado, José M Alvarez-Suarez, Eduardo Tejera

Background: Gastric cancer (GC) remains a major global health burden despite advances in diagnosis and treatment. In recent years, natural products have gained increasing attention as promising sources of anticancer agents, including GC.

Methods: In this study, we applied an in silico ensemble-based modeling strategy to predict compounds with potential inhibitory effects against four GC-related cell lines: AGS, NCI-N87, BGC-823, and SNU-16. Individual predictive models were developed using several algorithms and further integrated into two consensus ensemble multi-objective models. A comprehensive database of over 100,000 natural compounds from 21,665 plant species, was screened for validation and to identify potential molecular candidates.

Results: The ensemble models demonstrated a 12-15-fold improvement in identifying active molecules compared to random selection. A total of 340 molecules were prioritized, many belonging to bioactive classes such as taxane diterpenoids, flavonoids, isoflavonoids, phloroglucinols, and tryptophan alkaloids. Known anticancer compounds, including paclitaxel, orsaponin (OSW-1), glycybenzofuran, and glyurallin A, were successfully retrieved, reinforcing the validity of the approach. Species from the genera Taxus, Glycyrrhiza, Elaphoglossum, and Seseli emerged as particularly relevant sources of bioactive candidates.

Conclusion: While some genera, such as Taxus and Glycyrrhiza, have well-documented anticancer properties, others, including Elaphoglossum and Seseli, require further experimental validation. These findings highlight the potential of combining multi-objectives ensemble modeling with natural product databases to discover novel phytochemicals relevant to GC treatment.

背景:尽管在诊断和治疗方面取得了进展,胃癌(GC)仍然是全球主要的健康负担。近年来,天然产物作为抗癌药物的有前途的来源受到越来越多的关注,包括GC。方法:在本研究中,我们采用基于硅集成的建模策略来预测对四种gc相关细胞系(AGS, NCI-N87, BGC-823和SNU-16)具有潜在抑制作用的化合物。使用多种算法建立了个体预测模型,并进一步整合到两个共识集成多目标模型中。筛选了来自21,665种植物的超过100,000种天然化合物的综合数据库,以进行验证并确定潜在的分子候选物。结果:与随机选择相比,集成模型在识别活性分子方面提高了12-15倍。总共340个分子被优先考虑,其中许多属于生物活性类,如紫杉烷二萜、类黄酮、异类黄酮、间苯三酚和色氨酸生物碱。已知的抗癌化合物,包括紫杉醇,或皂苷(OSW-1), glycybenzofuran和glyurallin A,成功地检索,加强了该方法的有效性。红豆杉属、Glycyrrhiza属、Elaphoglossum属和Seseli属的物种是特别相关的生物活性候选来源。结论:虽然一些属,如红豆杉和甘草,具有良好的抗癌特性,但其他属,包括Elaphoglossum和Seseli,需要进一步的实验验证。这些发现突出了将多目标集成模型与天然产物数据库相结合,以发现与GC处理相关的新型植物化学物质的潜力。
{"title":"Discovering molecules and plants with potential activity against gastric cancer: an <i>in silico</i> ensemble-based modeling analysis.","authors":"Micaela Villacrés, Alec Avila, Karina Jimenes-Vargas, António Machado, José M Alvarez-Suarez, Eduardo Tejera","doi":"10.3389/fbinf.2025.1642039","DOIUrl":"10.3389/fbinf.2025.1642039","url":null,"abstract":"<p><strong>Background: </strong>Gastric cancer (GC) remains a major global health burden despite advances in diagnosis and treatment. In recent years, natural products have gained increasing attention as promising sources of anticancer agents, including GC.</p><p><strong>Methods: </strong>In this study, we applied an <i>in silico</i> ensemble-based modeling strategy to predict compounds with potential inhibitory effects against four GC-related cell lines: AGS, NCI-N87, BGC-823, and SNU-16. Individual predictive models were developed using several algorithms and further integrated into two consensus ensemble multi-objective models. A comprehensive database of over 100,000 natural compounds from 21,665 plant species, was screened for validation and to identify potential molecular candidates.</p><p><strong>Results: </strong>The ensemble models demonstrated a 12-15-fold improvement in identifying active molecules compared to random selection. A total of 340 molecules were prioritized, many belonging to bioactive classes such as taxane diterpenoids, flavonoids, isoflavonoids, phloroglucinols, and tryptophan alkaloids. Known anticancer compounds, including paclitaxel, orsaponin (OSW-1), glycybenzofuran, and glyurallin A, were successfully retrieved, reinforcing the validity of the approach. Species from the genera <i>Taxus</i>, <i>Glycyrrhiza</i>, <i>Elaphoglossum</i>, and <i>Seseli</i> emerged as particularly relevant sources of bioactive candidates.</p><p><strong>Conclusion: </strong>While some genera, such as <i>Taxus</i> and <i>Glycyrrhiza</i>, have well-documented anticancer properties, others, including <i>Elaphoglossum</i> and <i>Seseli</i>, require further experimental validation. These findings highlight the potential of combining multi-objectives ensemble modeling with natural product databases to discover novel phytochemicals relevant to GC treatment.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1642039"},"PeriodicalIF":3.9,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12518311/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145304800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrative machine learning and transcriptomic analysis identifies key molecular targets in MNPN-associated oral squamous cell carcinoma pathogenesis. 综合机器学习和转录组学分析确定了与mnpn相关的口腔鳞状细胞癌发病机制的关键分子靶点。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-25 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1664576
Xiangjun Wang, Panpan Jin, Juan Xu, Junyi Li, Mengzhen Ji

Background: Oral squamous cell carcinoma (OSCC) represents a significant global health challenge, with betel nut consumption being a major risk factor. 3-(methylnitrosamino)propionitrile (MNPN), a betel nut-derived nitrosamine, has been identified as a potential carcinogen, but its molecular targets in OSCC pathogenesis remain poorly understood.

Methods: We employed a comprehensive computational framework integrating target prediction, transcriptomic analysis, weighted gene co-expression network analysis (WGCNA), and machine learning approaches. Four OSCC datasets from Gene Expression Omnibus (GEO) were analyzed, and MNPN targets were predicted using ChEMBL, PharmMapper, and SwissTargetPrediction databases. Machine learning algorithms (n = 127 combinations) were evaluated for optimal biomarker identification, with model interpretability assessed using SHAP (SHapley Additive exPlanations) analysis.

Results: Target prediction identified 881 potential MNPN targets across three databases. WGCNA revealed 534 OSCC-associated differentially expressed genes, with 38 overlapping MNPN targets. Machine learning optimization identified 13 hub genes, with PLAU demonstrating the highest predictive performance (AUC = 0.944). SHAP analysis confirmed PLAU and PLOD3 as the most influential contributors to disease prediction. Functional enrichment analysis revealed MNPN targets' involvement in xenobiotic response, hypoxic conditions, and aberrant tissue remodeling.

Conclusion: This study provides the first comprehensive molecular characterization of MNPN-associated OSCC pathogenesis, identifying PLAU as a critical therapeutic target with exceptional diagnostic potential. Our findings establish a foundation for developing targeted interventions for betel nut nitrosamine-associated oral cancers and demonstrate the power of integrative computational approaches in environmental carcinogen research.

背景:口腔鳞状细胞癌(OSCC)是一个重大的全球健康挑战,槟榔是一个主要的危险因素。3-(甲基亚硝胺)丙腈(MNPN)是一种源自槟榔的亚硝胺,已被确定为一种潜在的致癌物,但其在OSCC发病机制中的分子靶点尚不清楚。方法:我们采用了一个综合的计算框架,整合了目标预测、转录组学分析、加权基因共表达网络分析(WGCNA)和机器学习方法。分析来自Gene Expression Omnibus (GEO)的4个OSCC数据集,并使用ChEMBL、PharmMapper和SwissTargetPrediction数据库预测MNPN靶点。评估机器学习算法(n = 127个组合)以确定最佳生物标志物,并使用SHapley加性解释(SHapley Additive explanation)分析评估模型的可解释性。结果:目标预测在三个数据库中确定了881个潜在的MNPN目标。WGCNA共发现534个oscc相关差异表达基因,其中38个MNPN靶点重叠。机器学习优化识别出13个轮毂基因,其中PLAU的预测性能最高(AUC = 0.944)。SHAP分析证实PLAU和PLOD3是预测疾病最具影响力的因子。功能富集分析显示MNPN靶点参与异种生物反应、缺氧条件和异常组织重塑。结论:本研究首次提供了mnpn相关OSCC发病机制的全面分子特征,确定了PLAU是具有特殊诊断潜力的关键治疗靶点。我们的研究结果为开发针对槟榔亚硝胺相关口腔癌的靶向干预奠定了基础,并展示了综合计算方法在环境致癌物研究中的力量。
{"title":"Integrative machine learning and transcriptomic analysis identifies key molecular targets in MNPN-associated oral squamous cell carcinoma pathogenesis.","authors":"Xiangjun Wang, Panpan Jin, Juan Xu, Junyi Li, Mengzhen Ji","doi":"10.3389/fbinf.2025.1664576","DOIUrl":"10.3389/fbinf.2025.1664576","url":null,"abstract":"<p><strong>Background: </strong>Oral squamous cell carcinoma (OSCC) represents a significant global health challenge, with betel nut consumption being a major risk factor. 3-(methylnitrosamino)propionitrile (MNPN), a betel nut-derived nitrosamine, has been identified as a potential carcinogen, but its molecular targets in OSCC pathogenesis remain poorly understood.</p><p><strong>Methods: </strong>We employed a comprehensive computational framework integrating target prediction, transcriptomic analysis, weighted gene co-expression network analysis (WGCNA), and machine learning approaches. Four OSCC datasets from Gene Expression Omnibus (GEO) were analyzed, and MNPN targets were predicted using ChEMBL, PharmMapper, and SwissTargetPrediction databases. Machine learning algorithms (n = 127 combinations) were evaluated for optimal biomarker identification, with model interpretability assessed using SHAP (SHapley Additive exPlanations) analysis.</p><p><strong>Results: </strong>Target prediction identified 881 potential MNPN targets across three databases. WGCNA revealed 534 OSCC-associated differentially expressed genes, with 38 overlapping MNPN targets. Machine learning optimization identified 13 hub genes, with PLAU demonstrating the highest predictive performance (AUC = 0.944). SHAP analysis confirmed PLAU and PLOD3 as the most influential contributors to disease prediction. Functional enrichment analysis revealed MNPN targets' involvement in xenobiotic response, hypoxic conditions, and aberrant tissue remodeling.</p><p><strong>Conclusion: </strong>This study provides the first comprehensive molecular characterization of MNPN-associated OSCC pathogenesis, identifying PLAU as a critical therapeutic target with exceptional diagnostic potential. Our findings establish a foundation for developing targeted interventions for betel nut nitrosamine-associated oral cancers and demonstrate the power of integrative computational approaches in environmental carcinogen research.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1664576"},"PeriodicalIF":3.9,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12508658/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145282010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Computational drug repurposing reveals Alectinib as a potential lead targeting Cathepsin S for therapeutic developments against cancer and chronic pain. 计算药物再利用揭示了Alectinib作为潜在的先导靶向组织蛋白酶S治疗癌症和慢性疼痛的发展。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-24 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1666573
Mohammed Alrouji, Mohammed S Alshammari, Sharif Alhajlah, Syed Tasqeeruddin, Khuzin Dinislam, Anas Shamsi, Saleha Anwar

Cathepsin S (CathS) is a cysteine protease known to play a role in extracellular matrix (ECM) re-modelling, antigen presentation, immune cells polarisation, and cancer progression and chronic pain pathophysiology. CathS also causes an immunosuppressive environment in solid tumors and is involved in nociceptive signaling. Although several small-molecule inhibitors with favorable in vivo properties have been developed, their clinical utility is limited due to resistance, off-target effects, and suboptimal efficacy. Therefore, alternative therapeutic strategies are urgently needed. In the present study, we utilized an integrated virtual screening protocol to screen 3,500 commercially available FDA-approved drug molecules from DrugBank against the CathS crystal structure, based on which drug-likeness profile and interaction studies were performed to filter putative candidates. Alectinib was found to be a top hit and had significant interactions with the important active-site residues His278 and Cys139. PASS predictions suggested relevant anticancer and anti-pain activities for Alectinib in reference to the control inhibitor Q1N. Later, 500-ns molecular dynamics simulations under the CHARMM36 condition revealed that the CathS-Alectinib complex maintained its structural stability, as indicated by conformational parameters, hydrogen-bond persistence, and essential dynamics analyses. Further MM-PBSA calculations also confirmed a favorable binding free energy (ΔG -20.16 ± 2.59 kcal/mol) dominated by the van der Waals and electrostatic contributions. These computational findings suggest that Alectinib may have potential as a repurposed CathS inhibitor, warranting further experimental testing in relevant cancer and chronic pain models. Notably, these results are based solely on computational analysis and require empirical validation.

组织蛋白酶S (CathS)是一种半胱氨酸蛋白酶,已知在细胞外基质(ECM)重塑、抗原呈递、免疫细胞极化、癌症进展和慢性疼痛病理生理中发挥作用。在实体肿瘤中,CathS也引起免疫抑制环境,并参与伤害性信号传导。尽管已经开发出几种具有良好体内特性的小分子抑制剂,但由于耐药、脱靶效应和疗效欠佳,它们的临床应用受到限制。因此,迫切需要替代治疗策略。在本研究中,我们利用一个集成的虚拟筛选方案筛选了来自DrugBank的3500个经fda批准的商业化药物分子,并根据cths晶体结构进行了药物相似性分析和相互作用研究,以筛选候选药物。Alectinib被发现是一个顶hit,并且与重要的活性位点残基His278和Cys139有显著的相互作用。PASS预测表明,与对照抑制剂Q1N相比,Alectinib具有相关的抗癌和抗疼痛活性。随后,在CHARMM36条件下进行的500-ns分子动力学模拟表明,CathS-Alectinib配合物的构象参数、氢键持久性和基本动力学分析表明,其结构保持稳定。进一步的MM-PBSA计算也证实了良好的结合自由能(ΔG -20.16±2.59 kcal/mol)主要由范德华和静电贡献。这些计算结果表明,Alectinib可能有潜力作为一种重新用途的CathS抑制剂,值得在相关癌症和慢性疼痛模型中进一步进行实验测试。值得注意的是,这些结果仅基于计算分析,需要经验验证。
{"title":"Computational drug repurposing reveals Alectinib as a potential lead targeting Cathepsin S for therapeutic developments against cancer and chronic pain.","authors":"Mohammed Alrouji, Mohammed S Alshammari, Sharif Alhajlah, Syed Tasqeeruddin, Khuzin Dinislam, Anas Shamsi, Saleha Anwar","doi":"10.3389/fbinf.2025.1666573","DOIUrl":"10.3389/fbinf.2025.1666573","url":null,"abstract":"<p><p>Cathepsin S (CathS) is a cysteine protease known to play a role in extracellular matrix (ECM) re-modelling, antigen presentation, immune cells polarisation, and cancer progression and chronic pain pathophysiology. CathS also causes an immunosuppressive environment in solid tumors and is involved in nociceptive signaling. Although several small-molecule inhibitors with favorable <i>in vivo</i> properties have been developed, their clinical utility is limited due to resistance, off-target effects, and suboptimal efficacy. Therefore, alternative therapeutic strategies are urgently needed. In the present study, we utilized an integrated virtual screening protocol to screen 3,500 commercially available FDA-approved drug molecules from DrugBank against the CathS crystal structure, based on which drug-likeness profile and interaction studies were performed to filter putative candidates. Alectinib was found to be a top hit and had significant interactions with the important active-site residues His278 and Cys139. PASS predictions suggested relevant anticancer and anti-pain activities for Alectinib in reference to the control inhibitor Q1N. Later, 500-ns molecular dynamics simulations under the CHARMM36 condition revealed that the CathS-Alectinib complex maintained its structural stability, as indicated by conformational parameters, hydrogen-bond persistence, and essential dynamics analyses. Further MM-PBSA calculations also confirmed a favorable binding free energy (Δ<i>G</i> -20.16 ± 2.59 kcal/mol) dominated by the van der Waals and electrostatic contributions. These computational findings suggest that Alectinib may have potential as a repurposed CathS inhibitor, warranting further experimental testing in relevant cancer and chronic pain models. Notably, these results are based solely on computational analysis and require empirical validation.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1666573"},"PeriodicalIF":3.9,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12504298/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145260089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extracting a COVID-19 signature from a multi-omic dataset. 从多基因组数据集中提取COVID-19特征。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-22 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1645785
Baptiste Bauvin, Thibaud Godon, Guillaume Bachelot, Claudia Carpentier, Riikka Huusaari, Maxime Deraspe, Juho Rousu, Caroline Quach, Jacques Corbeil

Introduction: The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinical, proteomic, and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery.

Methods: As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures.

Results: Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable.

Discussion: This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.

导言:COVID-19的复杂性要求采取超越基于症状描述符的方法。多组学数据,结合临床、蛋白质组学和代谢组学信息,为疾病机制和生物标志物的发现提供了更详细的视角。方法:作为魁北克大规模倡议的一部分,我们从COVID-19阳性和阴性患者样本中收集了大量数据集。使用集成方法的多视图机器学习框架,我们整合了临床、蛋白质组学和代谢组学领域的数千个特征,对COVID-19状态进行分类。我们进一步应用了一种新的特征关联方法来识别压缩签名。结果:尽管数据具有高维性质,但我们的模型实现了89%±5%的平衡精度。特征选择产生了12个和50个特征签名,与完整的特征集相比,分类准确率至少提高了3%。这些签名既准确又可解释。讨论:这项工作表明,多组学集成与先进的机器学习相结合,可以从复杂的数据集中提取稳健的COVID-19特征。浓缩的生物标志物集为改进诊断和精准医疗提供了实用途径,代表了COVID-19生物标志物发现的重大进展。
{"title":"Extracting a COVID-19 signature from a multi-omic dataset.","authors":"Baptiste Bauvin, Thibaud Godon, Guillaume Bachelot, Claudia Carpentier, Riikka Huusaari, Maxime Deraspe, Juho Rousu, Caroline Quach, Jacques Corbeil","doi":"10.3389/fbinf.2025.1645785","DOIUrl":"10.3389/fbinf.2025.1645785","url":null,"abstract":"<p><strong>Introduction: </strong>The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinical, proteomic, and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery.</p><p><strong>Methods: </strong>As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures.</p><p><strong>Results: </strong>Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable.</p><p><strong>Discussion: </strong>This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1645785"},"PeriodicalIF":3.9,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12497780/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145245939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Frontiers in bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1