首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
MCLCBA: multi-view contrastive learning network for RNA methylation site prediction. MCLCBA:用于RNA甲基化位点预测的多视图对比学习网络。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-19 DOI: 10.1186/s12859-025-06306-x
Honglei Wang, Xuesong Zhang, Yanjing Sun, Zhaoyang Liu, Lin Zhang

Background: RNA methylation (RM) regulates gene expression regulation, RNA stability, and protein translation. Accurate prediction of RM modification sites is essential for understanding their biological functions. However, existing wet-lab detection techniques face challenges including operational complexity and high costs. Deep learning (DL) methods have been applied to this task. However, existing methods show performance degradation with smaller training datasets. For instance, the Bidirectional Gated Recurrent Unit (BGRU) demonstrates substantial performance degradation. Contrastive Learning Network (CNN) can extract local pattern features but learns overly specific patterns with sample-limited data, resulting in poor feature generalization. Bidirectional Long Short-Term Memory (BiLSTM) excels at modeling long-range dependencies but cannot sufficiently learn gating mechanism parameters to capture effective sequence representations with limited samples. Transformer processes sequences in parallel and captures global dependencies through self-attention, but its quadratic computational complexity and large parameter count make it prone to overfitting on small datasets. Current DL methods show reduced performance when training data is limited.

Results: This study proposes a Multi-view Contrastive Learning with CNN-BiLSTM-Attention (MCLCBA) framework for RM modification site prediction. The multi-view approach comprises a primary view and auxiliary view, where the primary view utilizes DNA Bidirectional Encoder Representations from Transformers (DNABERT) to extract sequence contextual features, and the auxiliary view employs Chaos Game Representation (CGR) to extract structural features. Feature extraction includes four components: data augmentation, multi-view encoders, projection heads, and contrastive loss functions. By implementing dual differential data augmentation strategies and constructing multi-view network architectures for feature processing and fusion, the model learns discriminative feature representations invariant to data augmentation through maximizing positive sample similarity while minimizing negative sample similarity. This effectively addresses sample-limited feature learning scenarios. Experimental results on the sample-limited m7G dataset demonstrate that MCLCBA achieves AUROC and AUPRC of 85.64% and 86.94%, respectively, improving upon existing methods by 5-6% in both metrics.

Conclusions: Through multi-view contrastive learning, MCLCBA provides an approach for RM sites under sample-limited scenarios.

背景:RNA甲基化(RM)调节基因表达调控、RNA稳定性和蛋白质翻译。准确预测RM修饰位点对了解其生物学功能至关重要。然而,现有的湿实验室检测技术面临着操作复杂性和高成本等挑战。深度学习(DL)方法已应用于此任务。然而,现有的方法在较小的训练数据集上表现出性能下降。例如,双向门控循环单元(BGRU)表现出明显的性能下降。对比学习网络(CNN)可以提取局部模式特征,但在样本有限的数据下学习过于特定的模式,导致特征泛化效果较差。双向长短期记忆(Bidirectional Long - short Memory, BiLSTM)擅长对长时间依赖关系进行建模,但无法充分学习门控机制参数,无法在有限的样本中捕获有效的序列表示。Transformer并行处理序列并通过自关注捕获全局依赖关系,但其二次计算复杂性和大参数计数使其容易在小数据集上过拟合。当前的深度学习方法在训练数据有限的情况下表现出较低的性能。结果:本研究提出了一种基于CNN-BiLSTM-Attention (MCLCBA)的多视角对比学习框架,用于RM修饰位点预测。多视图方法包括主视图和辅助视图,其中主视图利用变形变压器DNA双向编码器表示(DNABERT)提取序列上下文特征,辅助视图利用混沌博弈表示(CGR)提取序列结构特征。特征提取包括四个部分:数据增强、多视图编码器、投影头和对比损失函数。该模型通过实现双差分数据增强策略,构建多视图网络结构进行特征处理和融合,通过最大化正样本相似度和最小化负样本相似度来学习对数据增强不变的判别特征表示。这有效地解决了样本有限的特征学习场景。在样本有限的m7G数据集上的实验结果表明,MCLCBA的AUROC和AUPRC分别达到85.64%和86.94%,在这两个指标上都比现有方法提高了5-6%。结论:MCLCBA通过多视角对比学习,为样本有限的RM站点提供了一种方法。
{"title":"MCLCBA: multi-view contrastive learning network for RNA methylation site prediction.","authors":"Honglei Wang, Xuesong Zhang, Yanjing Sun, Zhaoyang Liu, Lin Zhang","doi":"10.1186/s12859-025-06306-x","DOIUrl":"10.1186/s12859-025-06306-x","url":null,"abstract":"<p><strong>Background: </strong>RNA methylation (RM) regulates gene expression regulation, RNA stability, and protein translation. Accurate prediction of RM modification sites is essential for understanding their biological functions. However, existing wet-lab detection techniques face challenges including operational complexity and high costs. Deep learning (DL) methods have been applied to this task. However, existing methods show performance degradation with smaller training datasets. For instance, the Bidirectional Gated Recurrent Unit (BGRU) demonstrates substantial performance degradation. Contrastive Learning Network (CNN) can extract local pattern features but learns overly specific patterns with sample-limited data, resulting in poor feature generalization. Bidirectional Long Short-Term Memory (BiLSTM) excels at modeling long-range dependencies but cannot sufficiently learn gating mechanism parameters to capture effective sequence representations with limited samples. Transformer processes sequences in parallel and captures global dependencies through self-attention, but its quadratic computational complexity and large parameter count make it prone to overfitting on small datasets. Current DL methods show reduced performance when training data is limited.</p><p><strong>Results: </strong>This study proposes a Multi-view Contrastive Learning with CNN-BiLSTM-Attention (MCLCBA) framework for RM modification site prediction. The multi-view approach comprises a primary view and auxiliary view, where the primary view utilizes DNA Bidirectional Encoder Representations from Transformers (DNABERT) to extract sequence contextual features, and the auxiliary view employs Chaos Game Representation (CGR) to extract structural features. Feature extraction includes four components: data augmentation, multi-view encoders, projection heads, and contrastive loss functions. By implementing dual differential data augmentation strategies and constructing multi-view network architectures for feature processing and fusion, the model learns discriminative feature representations invariant to data augmentation through maximizing positive sample similarity while minimizing negative sample similarity. This effectively addresses sample-limited feature learning scenarios. Experimental results on the sample-limited m<sup>7</sup>G dataset demonstrate that MCLCBA achieves AUROC and AUPRC of 85.64% and 86.94%, respectively, improving upon existing methods by 5-6% in both metrics.</p><p><strong>Conclusions: </strong>Through multi-view contrastive learning, MCLCBA provides an approach for RM sites under sample-limited scenarios.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"281"},"PeriodicalIF":3.3,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12628535/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Denoising single-cell RNA-seq data with a deep learning-embedded statistical framework. 基于深度学习嵌入统计框架的单细胞RNA-seq数据去噪。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-19 DOI: 10.1186/s12859-025-06296-w
Qinhuan Luo, Yongzhen Yu, Tianying Wang

Background: Single-cell RNA sequencing (scRNA-seq) provides extensive opportunities to explore cellular heterogeneity but is often limited by substantial technical noise and variability. The prevalence of zero counts, arising from both biological variation and technical dropout events, poses significant challenges for downstream analyses. Existing imputation methods face inherent trade-offs: statistical approaches maintain interpretability but exhibit limited capacity for capturing complex, non-linear gene expression relationships, whereas deep learning methods demonstrate superior flexibility but are prone to overfitting and lack mechanistic interpretability, particularly in settings with limited sample sizes.

Methods: We present ZILLNB (Zero-Inflated Latent factors Learning-based Negative Binomial), a novel computational framework that integrates zero-inflated negative binomial (ZINB) regression with deep generative modeling. ZILLNB employs an ensemble architecture combining Information Variational Autoencoder (InfoVAE) and Generative Adversarial Network (GAN) to learn latent representations at cellular and gene levels. These latent factors serve as dynamic covariates within a ZINB regression framework, with parameters iteratively optimized through an Expectation-Maximization algorithm. This approach enables systematic decomposition of technical variability from intrinsic biological heterogeneity.

Results: Comparative evaluations across multiple scRNA-seq datasets demonstrate ZILLNB's superior performance. In cell type classification tasks using mouse cortex and human PBMC datasets, ZILLNB achieved the highest Adjusted Rand index (ARI) and Adjusted Mutual Information (AMI) among tested methods, with improvements ranging from 0.05 to 0.2 over VIPER, scImpute, DCA, DeepImpute, SAVER, scMultiGAN and ALRA. For differential expression analysis validated against matched bulk RNA-seq data, ZILLNB demonstrated improvements ranging from 0.05 to 0.3 for area under the Receiver Operating Characteristic curve (AUC-ROC) and the Precision-Recall curve (AUC-PR) compared to standard and other imputation methods, with consistently lower false discovery rates. Application to idiopathic pulmonary fibrosis (IPF) datasets revealed distinct fibroblast subpopulations undergoing fibroblast-to-myofibroblast transition, validated through marker gene expression and pathway enrichment analyses.

Conclusion: ZILLNB provides a principled framework for addressing technical artifacts in scRNA-seq data while preserving biological variation. The integration of statistical modeling with deep learning enables robust performance across diverse analytical tasks, including cell type identification, differential expression analysis, and rare cell population discovery, demonstrating utility across common single-cell analysis tasks.

背景:单细胞RNA测序(scRNA-seq)为探索细胞异质性提供了广泛的机会,但往往受到大量技术噪音和可变性的限制。由生物变异和技术辍学事件引起的零计数的流行对下游分析提出了重大挑战。现有的归算方法面临着固有的权衡:统计方法保持可解释性,但在捕获复杂的非线性基因表达关系方面表现出有限的能力,而深度学习方法表现出优越的灵活性,但容易过度拟合,缺乏机制可解释性,特别是在样本量有限的情况下。方法:我们提出了零膨胀潜在因素学习-基于负二项(Zero-Inflated Latent factors Learning-based Negative Binomial),这是一个将零膨胀负二项(Zero-Inflated Negative Binomial, ZINB)回归与深度生成建模相结合的新型计算框架。ZILLNB采用信息变分自编码器(InfoVAE)和生成对抗网络(GAN)相结合的集成架构来学习细胞和基因水平的潜在表征。这些潜在因素在ZINB回归框架中作为动态协变量,参数通过期望最大化算法迭代优化。这种方法能够从内在的生物异质性中系统地分解技术变异性。结果:跨多个scRNA-seq数据集的比较评估表明ZILLNB具有优越的性能。在使用小鼠皮质和人类PBMC数据集的细胞类型分类任务中,ZILLNB在测试方法中获得了最高的调整Rand指数(ARI)和调整互信息(AMI),比VIPER、scImpute、DCA、DeepImpute、SAVER、scMultiGAN和ALRA提高了0.05 ~ 0.2。对于匹配的大量RNA-seq数据验证的差异表达分析,与标准方法和其他方法相比,ZILLNB在受试者工作特征曲线(AUC-ROC)和精确召回曲线(AUC-PR)下的面积改善了0.05至0.3,错误发现率始终较低。对特发性肺纤维化(IPF)数据集的应用显示,不同的成纤维细胞亚群正在经历成纤维细胞向肌成纤维细胞的转变,通过标记基因表达和途径富集分析得到了验证。结论:ZILLNB为解决scRNA-seq数据中的技术伪像提供了一个原则性框架,同时保留了生物变异。统计建模与深度学习的集成可以在不同的分析任务中实现强大的性能,包括细胞类型识别、差异表达分析和罕见细胞群发现,展示了跨常见单细胞分析任务的实用性。
{"title":"Denoising single-cell RNA-seq data with a deep learning-embedded statistical framework.","authors":"Qinhuan Luo, Yongzhen Yu, Tianying Wang","doi":"10.1186/s12859-025-06296-w","DOIUrl":"10.1186/s12859-025-06296-w","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNA-seq) provides extensive opportunities to explore cellular heterogeneity but is often limited by substantial technical noise and variability. The prevalence of zero counts, arising from both biological variation and technical dropout events, poses significant challenges for downstream analyses. Existing imputation methods face inherent trade-offs: statistical approaches maintain interpretability but exhibit limited capacity for capturing complex, non-linear gene expression relationships, whereas deep learning methods demonstrate superior flexibility but are prone to overfitting and lack mechanistic interpretability, particularly in settings with limited sample sizes.</p><p><strong>Methods: </strong>We present ZILLNB (Zero-Inflated Latent factors Learning-based Negative Binomial), a novel computational framework that integrates zero-inflated negative binomial (ZINB) regression with deep generative modeling. ZILLNB employs an ensemble architecture combining Information Variational Autoencoder (InfoVAE) and Generative Adversarial Network (GAN) to learn latent representations at cellular and gene levels. These latent factors serve as dynamic covariates within a ZINB regression framework, with parameters iteratively optimized through an Expectation-Maximization algorithm. This approach enables systematic decomposition of technical variability from intrinsic biological heterogeneity.</p><p><strong>Results: </strong>Comparative evaluations across multiple scRNA-seq datasets demonstrate ZILLNB's superior performance. In cell type classification tasks using mouse cortex and human PBMC datasets, ZILLNB achieved the highest Adjusted Rand index (ARI) and Adjusted Mutual Information (AMI) among tested methods, with improvements ranging from 0.05 to 0.2 over VIPER, scImpute, DCA, DeepImpute, SAVER, scMultiGAN and ALRA. For differential expression analysis validated against matched bulk RNA-seq data, ZILLNB demonstrated improvements ranging from 0.05 to 0.3 for area under the Receiver Operating Characteristic curve (AUC-ROC) and the Precision-Recall curve (AUC-PR) compared to standard and other imputation methods, with consistently lower false discovery rates. Application to idiopathic pulmonary fibrosis (IPF) datasets revealed distinct fibroblast subpopulations undergoing fibroblast-to-myofibroblast transition, validated through marker gene expression and pathway enrichment analyses.</p><p><strong>Conclusion: </strong>ZILLNB provides a principled framework for addressing technical artifacts in scRNA-seq data while preserving biological variation. The integration of statistical modeling with deep learning enables robust performance across diverse analytical tasks, including cell type identification, differential expression analysis, and rare cell population discovery, demonstrating utility across common single-cell analysis tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"282"},"PeriodicalIF":3.3,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12629073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AllergenAI: a deep learning model predicting allergenicity based on protein sequence. AllergenAI:基于蛋白质序列预测致敏性的深度学习模型。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-18 DOI: 10.1186/s12859-025-06302-1
Jiajia Liu, Surendra S Negi, Chengyuan Yang, Xiaobo Zhou, Catherine H Schein, Werner Braun, Pora Kim
{"title":"AllergenAI: a deep learning model predicting allergenicity based on protein sequence.","authors":"Jiajia Liu, Surendra S Negi, Chengyuan Yang, Xiaobo Zhou, Catherine H Schein, Werner Braun, Pora Kim","doi":"10.1186/s12859-025-06302-1","DOIUrl":"10.1186/s12859-025-06302-1","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"279"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
scMFF: a machine learning framework with multiple feature fusion strategies for cell type identification. scMFF:一种具有多种特征融合策略的机器学习框架,用于细胞类型识别。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-18 DOI: 10.1186/s12859-025-06309-8
Nan Sun, Yu Wang, Xiang Shi, Dengcheng Yang, Rongling Wu, Stephen S-T Yau

Accurate cell type classification is critical for downstream analysis in single-cell RNA sequencing (scRNA-seq). Most existing methods rely on a single type of feature representation-such as statistical, information theory, matrix factorization, or deep learning-based features. However, each captures different aspects of the data, and no single feature type can fully represent the complex differences between cell types. Moreover, naïvely concatenating multiple features may introduce redundancy or noise, reducing model performance. To address these challenges, we propose scMFF, which is a multiple feature fusion framework that integrates four features and explores six fusion strategies in combination with various classifiers for single-cell type classification. Comprehensive evaluations on 42 disease-related datasets and an external COVID-19 dataset demonstrate that scMFF outperforms single-feature approaches in terms of performance and stability, providing a reliable and effective solution for scRNA-seq data analysis.

准确的细胞类型分类对于单细胞RNA测序(scRNA-seq)的下游分析至关重要。大多数现有方法依赖于单一类型的特征表示,例如统计、信息论、矩阵分解或基于深度学习的特征。然而,每一种都捕获数据的不同方面,没有一种特征类型可以完全表示单元格类型之间的复杂差异。此外,naïvely连接多个特征可能会引入冗余或噪声,降低模型性能。为了解决这些挑战,我们提出了scMFF,它是一个多特征融合框架,集成了四个特征,并探索了六种融合策略,结合各种分类器进行单细胞类型分类。对42个疾病相关数据集和一个外部COVID-19数据集的综合评估表明,scMFF在性能和稳定性方面优于单特征方法,为scRNA-seq数据分析提供了可靠有效的解决方案。
{"title":"scMFF: a machine learning framework with multiple feature fusion strategies for cell type identification.","authors":"Nan Sun, Yu Wang, Xiang Shi, Dengcheng Yang, Rongling Wu, Stephen S-T Yau","doi":"10.1186/s12859-025-06309-8","DOIUrl":"10.1186/s12859-025-06309-8","url":null,"abstract":"<p><p>Accurate cell type classification is critical for downstream analysis in single-cell RNA sequencing (scRNA-seq). Most existing methods rely on a single type of feature representation-such as statistical, information theory, matrix factorization, or deep learning-based features. However, each captures different aspects of the data, and no single feature type can fully represent the complex differences between cell types. Moreover, naïvely concatenating multiple features may introduce redundancy or noise, reducing model performance. To address these challenges, we propose scMFF, which is a multiple feature fusion framework that integrates four features and explores six fusion strategies in combination with various classifiers for single-cell type classification. Comprehensive evaluations on 42 disease-related datasets and an external COVID-19 dataset demonstrate that scMFF outperforms single-feature approaches in terms of performance and stability, providing a reliable and effective solution for scRNA-seq data analysis.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"277"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625116/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Talk2Biomodels: AI agent-based open-source LLM initiative for kinetic biological models. talk2biommodels:基于人工智能代理的开源动态生物模型法学硕士计划。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-18 DOI: 10.1186/s12859-025-06310-1
Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey

Background: Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise.

Results: In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ .

Conclusions: In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.

背景:生物调控过程的定量动力学模型在理解疾病机制方面发挥着重要作用。然而,它们的模拟和分析需要专门领域的专业知识。在这项研究中,我们提出了talk2biommodels (T2B),这是一个开源的、用户友好的、基于大型语言模型的人工智能平台,旨在促进对生物系统计算模型的访问,并促进系统生物学中的公平性(可寻性、可访问性、互操作性和可重用性)原则。T2B允许用户通过自然语言对话与生物系统的数学模型进行交互和分析,从而降低模型解释和假设驱动探索的门槛。该平台原生支持用系统生物学标记语言编码的模型,系统生物学标记语言是计算生物学社区广泛采用的标准。T2B与生物模型数据库(https://www.ebi.ac.uk/biomodels/)集成,支持检索、模拟和分析策划系统生物学模型。我们通过精准医学、传染病流行病学和细胞系统中突发网络级特性的研究用例说明了该平台的功能——展示了没有经过正式建模训练的计算专家和领域科学家如何从复杂的生物模型中获得可操作的见解。talk2biommodels可以在https://github.com/VirtualPatientEngine/AIAgents4Pharma上找到。详细的文档和用例可在https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/上获得。结论:总之,T2B降低了非专家参与生物系统计算模型并从中提取见解的障碍,同时为专家提供了一个简化的界面来分析模型,并总体上有助于模型的标准化。
{"title":"Talk2Biomodels: AI agent-based open-source LLM initiative for kinetic biological models.","authors":"Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey","doi":"10.1186/s12859-025-06310-1","DOIUrl":"10.1186/s12859-025-06310-1","url":null,"abstract":"<p><strong>Background: </strong>Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise.</p><p><strong>Results: </strong>In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ .</p><p><strong>Conclusions: </strong>In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"276"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625589/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning for genomic prediction of growth traits in aquaculture: a case study of the Australasian snapper (Chrysophrys auratus). 机器学习用于水产养殖生长性状的基因组预测:以澳大利亚鲷鱼为例。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-18 DOI: 10.1186/s12859-025-06287-x
Ze Chen, Julie Blommaert, Yi Mei, Linley Jesson, Maren Wellenreuther, Mengjie Zhang

Background: Chrysophrys auratus (family: Sparidae), commonly known as Australasian snapper, is a warm-water species being developed as a candidate for aquaculture in New Zealand. Genomic selection of elite snapper offers significant potential to accelerate genetic gains in aquaculture; however, the complexity of genetic architecture, coupled with challenges such as missing data and high dimensionality, poses significant hurdles. Machine learning techniques have emerged as powerful tools in genomic selection programmes due to their flexibility and ability to model complex, polygenic and non-linear relationships between genotypes and traits. This study aims to develop a comprehensive machine learning framework to evaluate imputation methods and genomic prediction models, and identify single-nucleotide polymorphisms associated with growth traits in snapper, ultimately contributing to the advancement of selective breeding programmes.

Results: We evaluated multiple approaches for each component of the machine learning framework. We developed and evaluated the Domain Knowledge-based K-nearest neighbour (DK-KNN) imputation method, achieving a notably high imputation accuracy of 98.33% in simulation testing, outperforming two alternative imputation methods. Among feature selection and classification combinations evaluated for growth prediction, Chi-squared feature selection paired with Distance-Weighted Discrimination (Chi2-DWD) achieved 60% prediction accuracy, comparable to genomic best linear unbiased prediction (60.3%) but without requiring the genomic relationship matrix. Notably, the two-stage approach using Domain Knowledge-based Pre-filtering (DK Pre-filtering) as a pre-filter did not substantially impact prediction accuracy, and it proved valuable in reducing the dimensionality of the feature space without affecting model performance.

Conclusions: Integration of domain knowledge into machine learning frameworks effectively addresses missing values and high-dimensional challenges in snapper genomic data. The evaluated framework demonstrates that Chi2-DWD represents a promising combination for genomic prediction tasks. The DK Pre-filtering workflow as a pre-filtering method successfully removes redundant features without affecting model performance. Selected features showed biological significance and were confirmed to be associated with growth traits based on biological analysis, providing valuable insights for selective breeding programs.

背景:金蝶(Chrysophrys auratus,科:Sparidae),俗称澳洲鲷鱼,是新西兰正在开发的一种暖水物种,作为水产养殖的候选物种。优质鲷鱼的基因组选择为加速水产养殖的遗传增益提供了巨大的潜力;然而,遗传结构的复杂性,加上数据缺失和高维等挑战,构成了重大障碍。机器学习技术已成为基因组选择计划的强大工具,因为它们具有灵活性和建模基因型和性状之间复杂的多基因和非线性关系的能力。本研究旨在开发一个全面的机器学习框架来评估估算方法和基因组预测模型,并确定与鲷鱼生长性状相关的单核苷酸多态性,最终为选择性育种计划的推进做出贡献。结果:我们对机器学习框架的每个组件评估了多种方法。我们开发并评估了基于领域知识的k -近邻(DK-KNN)插值方法,在模拟测试中获得了98.33%的显著高插值精度,优于两种替代的插值方法。在评估用于生长预测的特征选择和分类组合中,卡方特征选择与距离加权辨别(Chi2-DWD)配对获得了60%的预测精度,与基因组最佳线性无偏预测(60.3%)相当,但不需要基因组关系矩阵。值得注意的是,使用基于领域知识的预滤波(DK Pre-filtering)作为预滤波的两阶段方法并没有实质性地影响预测精度,并且在不影响模型性能的情况下降低特征空间的维数。结论:将领域知识集成到机器学习框架中,有效地解决了鲷鱼基因组数据中的缺失值和高维挑战。评估的框架表明,Chi2-DWD代表了基因组预测任务的一个有前途的组合。DK预滤波工作流作为一种预滤波方法,在不影响模型性能的前提下成功地去除了冗余特征。所选择的特征具有生物学意义,并被生物学分析证实与生长性状相关,为选择育种计划提供了有价值的见解。
{"title":"Machine learning for genomic prediction of growth traits in aquaculture: a case study of the Australasian snapper (Chrysophrys auratus).","authors":"Ze Chen, Julie Blommaert, Yi Mei, Linley Jesson, Maren Wellenreuther, Mengjie Zhang","doi":"10.1186/s12859-025-06287-x","DOIUrl":"10.1186/s12859-025-06287-x","url":null,"abstract":"<p><strong>Background: </strong>Chrysophrys auratus (family: Sparidae), commonly known as Australasian snapper, is a warm-water species being developed as a candidate for aquaculture in New Zealand. Genomic selection of elite snapper offers significant potential to accelerate genetic gains in aquaculture; however, the complexity of genetic architecture, coupled with challenges such as missing data and high dimensionality, poses significant hurdles. Machine learning techniques have emerged as powerful tools in genomic selection programmes due to their flexibility and ability to model complex, polygenic and non-linear relationships between genotypes and traits. This study aims to develop a comprehensive machine learning framework to evaluate imputation methods and genomic prediction models, and identify single-nucleotide polymorphisms associated with growth traits in snapper, ultimately contributing to the advancement of selective breeding programmes.</p><p><strong>Results: </strong>We evaluated multiple approaches for each component of the machine learning framework. We developed and evaluated the Domain Knowledge-based K-nearest neighbour (DK-KNN) imputation method, achieving a notably high imputation accuracy of 98.33% in simulation testing, outperforming two alternative imputation methods. Among feature selection and classification combinations evaluated for growth prediction, Chi-squared feature selection paired with Distance-Weighted Discrimination (Chi2-DWD) achieved 60% prediction accuracy, comparable to genomic best linear unbiased prediction (60.3%) but without requiring the genomic relationship matrix. Notably, the two-stage approach using Domain Knowledge-based Pre-filtering (DK Pre-filtering) as a pre-filter did not substantially impact prediction accuracy, and it proved valuable in reducing the dimensionality of the feature space without affecting model performance.</p><p><strong>Conclusions: </strong>Integration of domain knowledge into machine learning frameworks effectively addresses missing values and high-dimensional challenges in snapper genomic data. The evaluated framework demonstrates that Chi2-DWD represents a promising combination for genomic prediction tasks. The DK Pre-filtering workflow as a pre-filtering method successfully removes redundant features without affecting model performance. Selected features showed biological significance and were confirmed to be associated with growth traits based on biological analysis, providing valuable insights for selective breeding programs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"278"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SeqForge: a scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets. SeqForge:一个可扩展的平台,用于基于比对的搜索,基序检测和跨元/基因组数据集的序列管理。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-18 DOI: 10.1186/s12859-025-06297-9
Elijah R Bring Horvath, Jaclyn M Winter

Background: The rapid increase in publicly available microbial and metagenomic data has created a growing demand for tools that can efficiently perform custom large-scale comparative searches and functional annotation. While BLAST + remains the standard for sequence similarity searches, population-level studies often require custom scripting and manual curation of results, which can present barriers for many researchers.

Results: We developed SeqForge, a scalable, modular command-line toolkit that streamlines alignment-based searches and motif mining across large genomic datasets. SeqForge automates BLAST + database creation and querying, integrates amino acid motif discovery, enables sequence and contig extraction, and curates results into structured, easily parsed formats. The platform supports diverse input formats, parallelized execution for high-performance computing environments, and built-in visualization tools. Benchmarking demonstrates that SeqForge achieves near-linear runtime scaling for computationally intensive modules while maintaining modest memory usage.

Conclusions: SeqForge lowers the computational barrier for large-scale meta/genomic exploration, enabling researchers to perform population-scale BLAST searches, motif detection, and sequence curation without custom scripting. The toolkit is freely available and platform-independent, making it suitable for both personal workstations and high-performance computing environments.

背景:可公开获得的微生物和宏基因组数据的快速增长,创造了对工具的不断增长的需求,这些工具可以有效地执行定制的大规模比较搜索和功能注释。虽然BLAST +仍然是序列相似性搜索的标准,但群体水平的研究通常需要自定义脚本和手动管理结果,这可能给许多研究人员带来障碍。结果:我们开发了SeqForge,这是一个可扩展的模块化命令行工具包,可以简化基于比对的搜索和跨大型基因组数据集的motif挖掘。SeqForge自动化BLAST +数据库创建和查询,集成氨基酸基序发现,支持序列和配置提取,并将结果整理成结构化,易于解析的格式。该平台支持多种输入格式、高性能计算环境的并行执行以及内置的可视化工具。基准测试表明,SeqForge在保持适度内存使用的同时,为计算密集型模块实现了近似线性的运行时扩展。结论:SeqForge降低了大规模元/基因组探索的计算障碍,使研究人员能够在没有自定义脚本的情况下进行群体规模的BLAST搜索、基序检测和序列管理。该工具包是免费提供的,并且与平台无关,因此既适合个人工作站,也适合高性能计算环境。
{"title":"SeqForge: a scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets.","authors":"Elijah R Bring Horvath, Jaclyn M Winter","doi":"10.1186/s12859-025-06297-9","DOIUrl":"10.1186/s12859-025-06297-9","url":null,"abstract":"<p><strong>Background: </strong>The rapid increase in publicly available microbial and metagenomic data has created a growing demand for tools that can efficiently perform custom large-scale comparative searches and functional annotation. While BLAST + remains the standard for sequence similarity searches, population-level studies often require custom scripting and manual curation of results, which can present barriers for many researchers.</p><p><strong>Results: </strong>We developed SeqForge, a scalable, modular command-line toolkit that streamlines alignment-based searches and motif mining across large genomic datasets. SeqForge automates BLAST + database creation and querying, integrates amino acid motif discovery, enables sequence and contig extraction, and curates results into structured, easily parsed formats. The platform supports diverse input formats, parallelized execution for high-performance computing environments, and built-in visualization tools. Benchmarking demonstrates that SeqForge achieves near-linear runtime scaling for computationally intensive modules while maintaining modest memory usage.</p><p><strong>Conclusions: </strong>SeqForge lowers the computational barrier for large-scale meta/genomic exploration, enabling researchers to perform population-scale BLAST searches, motif detection, and sequence curation without custom scripting. The toolkit is freely available and platform-independent, making it suitable for both personal workstations and high-performance computing environments.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"280"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph convolution network based on meta-paths and mutual information for drug-target interaction prediction. 基于元路径和互信息的图卷积网络药物-靶标相互作用预测。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-07 DOI: 10.1186/s12859-025-06295-x
Shujuan Cao, Binying Cai, Zhejian Qiu, Tiantian Chang, Qiqige Wuyun, Fang-Xiang Wu

Background: Predicting drug-target interactions (DTIs) plays a pivotal role in accelerating drug repositioning by prioritizing candidate drugs and reducing experimental costs. Despite advancements in deep learning, several challenges still require further exploration, including sparsity and inadequate representation of feature relationships.

Results: We propose GCNMM, a novel graph convolutional network based on meta-paths and mutual information, to predict latent DTIs in drug-target heterogeneous networks. Our approach begins by constructing a fused DTI network based on meta-paths and a graph attention network. We compute multiple similarity networks by using Jaccard coefficients and integrate them into the fused drug and target similarity networks through entropy-based fusion. These networks are then jointly processed by graph convolutional auto-encoder to generate low-dimensional feature representations. To preserve the topological structure of the original network in the embedding space and strengthen the relationship between the input and latent representations, we incorporate spatial topological consistency and mutual information maximization as dual optimization objectives.

Conclusions: The experimental results illustrate that GCNMM exhibits superior performance to existing baseline models in DTI prediction. Furthermore, case studies validate the practical effectiveness of GCNMM, highlighting its potential in DTI prediction and drug repositioning.

背景:预测药物-靶标相互作用(DTIs)在加速药物重新定位、确定候选药物优先级和降低实验成本方面发挥着关键作用。尽管深度学习取得了进步,但仍有一些挑战需要进一步探索,包括稀疏性和特征关系的不充分表示。结果:我们提出了一种基于元路径和互信息的新型图形卷积网络GCNMM,用于预测药物靶点异构网络中的潜在dti。我们的方法首先构建了一个基于元路径和图注意网络的融合DTI网络。我们利用Jaccard系数计算多个相似网络,并通过基于熵的融合将其整合到融合的药物和靶标相似网络中。然后通过图卷积自编码器对这些网络进行联合处理,生成低维特征表示。为了在嵌入空间中保留原始网络的拓扑结构,并加强输入和潜在表示之间的关系,我们将空间拓扑一致性和互信息最大化作为双重优化目标。结论:实验结果表明,GCNMM在DTI预测中表现出优于现有基线模型的性能。此外,案例研究验证了GCNMM的实际有效性,突出了其在DTI预测和药物重新定位方面的潜力。
{"title":"Graph convolution network based on meta-paths and mutual information for drug-target interaction prediction.","authors":"Shujuan Cao, Binying Cai, Zhejian Qiu, Tiantian Chang, Qiqige Wuyun, Fang-Xiang Wu","doi":"10.1186/s12859-025-06295-x","DOIUrl":"10.1186/s12859-025-06295-x","url":null,"abstract":"<p><strong>Background: </strong>Predicting drug-target interactions (DTIs) plays a pivotal role in accelerating drug repositioning by prioritizing candidate drugs and reducing experimental costs. Despite advancements in deep learning, several challenges still require further exploration, including sparsity and inadequate representation of feature relationships.</p><p><strong>Results: </strong>We propose GCNMM, a novel graph convolutional network based on meta-paths and mutual information, to predict latent DTIs in drug-target heterogeneous networks. Our approach begins by constructing a fused DTI network based on meta-paths and a graph attention network. We compute multiple similarity networks by using Jaccard coefficients and integrate them into the fused drug and target similarity networks through entropy-based fusion. These networks are then jointly processed by graph convolutional auto-encoder to generate low-dimensional feature representations. To preserve the topological structure of the original network in the embedding space and strengthen the relationship between the input and latent representations, we incorporate spatial topological consistency and mutual information maximization as dual optimization objectives.</p><p><strong>Conclusions: </strong>The experimental results illustrate that GCNMM exhibits superior performance to existing baseline models in DTI prediction. Furthermore, case studies validate the practical effectiveness of GCNMM, highlighting its potential in DTI prediction and drug repositioning.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"275"},"PeriodicalIF":3.3,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12595897/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145470547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TransST: transfer learning embedded spatial factor modeling of spatial transcriptomics data. TransST:迁移学习嵌入空间转录组学数据的空间因子建模。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-06 DOI: 10.1186/s12859-025-06099-z
Shuo Shuo Liu, Shikun Wang, Yuxuan Chen, Anil K Rustgi, Ming Yuan, Jianhua Hu

Background: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data.

Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods.

Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.

背景:空间转录组学已经成为生物医学研究的有力工具,因为它能够捕获感兴趣器官的空间背景和完整RNA转录谱的丰度。然而,该技术的局限性,如相对较低的分辨率和相对不足的测序深度,使得难以可靠地从这些数据中提取真实的生物信号。为了缓解这一挑战,我们提出了一种新的迁移学习框架,称为TransST,以自适应地利用来自外部来源的细胞标记信息来推断目标空间转录组学数据的细胞水平异质性。结果:在几项实际研究中的应用以及一些模拟设置表明,我们的方法显着改进了现有技术。例如,在乳腺癌研究中,TransST成功识别了五个具有生物学意义的细胞簇,包括原位癌和浸润性癌两个亚群;此外,在所有研究的方法中,只有TransST能够将脂肪组织与结缔组织分离。综上所述,TransST方法在空间转录组学数据中识别细胞亚簇和检测相应的驱动生物标志物方面既有效又稳健。
{"title":"TransST: transfer learning embedded spatial factor modeling of spatial transcriptomics data.","authors":"Shuo Shuo Liu, Shikun Wang, Yuxuan Chen, Anil K Rustgi, Ming Yuan, Jianhua Hu","doi":"10.1186/s12859-025-06099-z","DOIUrl":"10.1186/s12859-025-06099-z","url":null,"abstract":"<p><strong>Background: </strong>Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data.</p><p><strong>Results: </strong>Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods.</p><p><strong>Conclusions: </strong>In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"274"},"PeriodicalIF":3.3,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12593783/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145457374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A lightweight single-view contrastive learning hypergraph neural network for food-microbe-disease association prediction. 用于食物-微生物-疾病关联预测的轻量级单视图对比学习超图神经网络。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-04 DOI: 10.1186/s12859-025-06283-1
Jianqiang Hu, Mingyi Hu, Yangxiang Wu, Songyao Mu, Dahao Huang, Baolong Wang, Yuchen Gao, Shixin Gu, Jinlin Zhu

Background: Identifying potential associations among food, gut microbiota and disease is fundamental for elucidating interaction mechanisms and advancing personalized healthy dietary strategies. While computational methods have been extensively applied to predict microbiota-disease associations, methods on predicting food-microbiota relationships remain limited, particularly regarding higher-order food-microbiota-disease interactions.

Results: In this work, we construct a food-microbe-disease (FMD) database encompassing 190 food items, 219 gut microbiota species, and 163 disease entities, resulting in 17,065 FMD associations. We then propose a lightweight single-view contrastive learning hypergraph neural network (LSCHNN) for FMD association prediction on the sparse FMD dataset. LSCHNN formulates ternary FMD interactions as a hypergraph, in which foods, microbes, and diseases are represented by nodes and FMD triplets are represented by hyperedges, and leverages the biological features of foods, microbes, and diseases as node attributes. Subsequently, a hypergraph neural network is designed to learn the embeddings of foods, microbes, and diseases from the hypergraph and predict potential ternary FMD associations. Additionally, we incorporate a single-view contrastive learning mechanism that enhances the model's ability to extract discriminative features and improves generalization on sparse data. Comprehensive comparison experiments demonstrate that LSCHNN outperforms other state-of-the-art methods in terms of the precision of predicting ternary FMD associations and discovering more potential FMD associations. Case studies on two microbes further confirm the effectiveness of LSCHNN in identifying potential FMD associations.

Conclusions: A novel computational model, LSCHNN, is proposed, marking the first integration of hypergraph neural networks with lightweight single-view contrastive learning for ternary FMD association prediction, providing a groundbreaking framework for precision nutrition and personalized dietary interventions.

背景:确定食物、肠道菌群和疾病之间的潜在关联是阐明相互作用机制和推进个性化健康饮食策略的基础。虽然计算方法已广泛应用于预测微生物群-疾病关联,但预测食物-微生物群关系的方法仍然有限,特别是关于高阶食物-微生物群-疾病相互作用的方法。结果:在这项工作中,我们构建了一个食物微生物-疾病(FMD)数据库,包括190种食物,219种肠道微生物群和163种疾病实体,得出17065种口蹄疫关联。然后,我们提出了一种轻量级的单视图对比学习超图神经网络(LSCHNN),用于稀疏FMD数据集上的FMD关联预测。LSCHNN将三元口蹄疫相互作用表述为一个超图,其中食物、微生物和疾病由节点表示,口蹄疫三元组由超边缘表示,并利用食物、微生物和疾病的生物学特征作为节点属性。随后,设计了一个超图神经网络,从超图中学习食物、微生物和疾病的嵌入,并预测潜在的三元口蹄疫关联。此外,我们结合了一个单视图对比学习机制,增强了模型提取判别特征的能力,提高了对稀疏数据的泛化。综合对比实验表明,LSCHNN在预测三元FMD关联和发现更多潜在FMD关联的精度方面优于其他最先进的方法。对两种微生物的案例研究进一步证实了LSCHNN在识别口蹄疫潜在关联方面的有效性。结论:提出了一种新的计算模型LSCHNN,这标志着超图神经网络与轻量级单视图对比学习的首次集成,用于三元FMD关联预测,为精确营养和个性化饮食干预提供了开创性的框架。
{"title":"A lightweight single-view contrastive learning hypergraph neural network for food-microbe-disease association prediction.","authors":"Jianqiang Hu, Mingyi Hu, Yangxiang Wu, Songyao Mu, Dahao Huang, Baolong Wang, Yuchen Gao, Shixin Gu, Jinlin Zhu","doi":"10.1186/s12859-025-06283-1","DOIUrl":"10.1186/s12859-025-06283-1","url":null,"abstract":"<p><strong>Background: </strong>Identifying potential associations among food, gut microbiota and disease is fundamental for elucidating interaction mechanisms and advancing personalized healthy dietary strategies. While computational methods have been extensively applied to predict microbiota-disease associations, methods on predicting food-microbiota relationships remain limited, particularly regarding higher-order food-microbiota-disease interactions.</p><p><strong>Results: </strong>In this work, we construct a food-microbe-disease (FMD) database encompassing 190 food items, 219 gut microbiota species, and 163 disease entities, resulting in 17,065 FMD associations. We then propose a lightweight single-view contrastive learning hypergraph neural network (LSCHNN) for FMD association prediction on the sparse FMD dataset. LSCHNN formulates ternary FMD interactions as a hypergraph, in which foods, microbes, and diseases are represented by nodes and FMD triplets are represented by hyperedges, and leverages the biological features of foods, microbes, and diseases as node attributes. Subsequently, a hypergraph neural network is designed to learn the embeddings of foods, microbes, and diseases from the hypergraph and predict potential ternary FMD associations. Additionally, we incorporate a single-view contrastive learning mechanism that enhances the model's ability to extract discriminative features and improves generalization on sparse data. Comprehensive comparison experiments demonstrate that LSCHNN outperforms other state-of-the-art methods in terms of the precision of predicting ternary FMD associations and discovering more potential FMD associations. Case studies on two microbes further confirm the effectiveness of LSCHNN in identifying potential FMD associations.</p><p><strong>Conclusions: </strong>A novel computational model, LSCHNN, is proposed, marking the first integration of hypergraph neural networks with lightweight single-view contrastive learning for ternary FMD association prediction, providing a groundbreaking framework for precision nutrition and personalized dietary interventions.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"273"},"PeriodicalIF":3.3,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12584493/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145443977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1