Pub Date : 2025-11-19DOI: 10.1186/s12859-025-06306-x
Honglei Wang, Xuesong Zhang, Yanjing Sun, Zhaoyang Liu, Lin Zhang
Background: RNA methylation (RM) regulates gene expression regulation, RNA stability, and protein translation. Accurate prediction of RM modification sites is essential for understanding their biological functions. However, existing wet-lab detection techniques face challenges including operational complexity and high costs. Deep learning (DL) methods have been applied to this task. However, existing methods show performance degradation with smaller training datasets. For instance, the Bidirectional Gated Recurrent Unit (BGRU) demonstrates substantial performance degradation. Contrastive Learning Network (CNN) can extract local pattern features but learns overly specific patterns with sample-limited data, resulting in poor feature generalization. Bidirectional Long Short-Term Memory (BiLSTM) excels at modeling long-range dependencies but cannot sufficiently learn gating mechanism parameters to capture effective sequence representations with limited samples. Transformer processes sequences in parallel and captures global dependencies through self-attention, but its quadratic computational complexity and large parameter count make it prone to overfitting on small datasets. Current DL methods show reduced performance when training data is limited.
Results: This study proposes a Multi-view Contrastive Learning with CNN-BiLSTM-Attention (MCLCBA) framework for RM modification site prediction. The multi-view approach comprises a primary view and auxiliary view, where the primary view utilizes DNA Bidirectional Encoder Representations from Transformers (DNABERT) to extract sequence contextual features, and the auxiliary view employs Chaos Game Representation (CGR) to extract structural features. Feature extraction includes four components: data augmentation, multi-view encoders, projection heads, and contrastive loss functions. By implementing dual differential data augmentation strategies and constructing multi-view network architectures for feature processing and fusion, the model learns discriminative feature representations invariant to data augmentation through maximizing positive sample similarity while minimizing negative sample similarity. This effectively addresses sample-limited feature learning scenarios. Experimental results on the sample-limited m7G dataset demonstrate that MCLCBA achieves AUROC and AUPRC of 85.64% and 86.94%, respectively, improving upon existing methods by 5-6% in both metrics.
Conclusions: Through multi-view contrastive learning, MCLCBA provides an approach for RM sites under sample-limited scenarios.
背景:RNA甲基化(RM)调节基因表达调控、RNA稳定性和蛋白质翻译。准确预测RM修饰位点对了解其生物学功能至关重要。然而,现有的湿实验室检测技术面临着操作复杂性和高成本等挑战。深度学习(DL)方法已应用于此任务。然而,现有的方法在较小的训练数据集上表现出性能下降。例如,双向门控循环单元(BGRU)表现出明显的性能下降。对比学习网络(CNN)可以提取局部模式特征,但在样本有限的数据下学习过于特定的模式,导致特征泛化效果较差。双向长短期记忆(Bidirectional Long - short Memory, BiLSTM)擅长对长时间依赖关系进行建模,但无法充分学习门控机制参数,无法在有限的样本中捕获有效的序列表示。Transformer并行处理序列并通过自关注捕获全局依赖关系,但其二次计算复杂性和大参数计数使其容易在小数据集上过拟合。当前的深度学习方法在训练数据有限的情况下表现出较低的性能。结果:本研究提出了一种基于CNN-BiLSTM-Attention (MCLCBA)的多视角对比学习框架,用于RM修饰位点预测。多视图方法包括主视图和辅助视图,其中主视图利用变形变压器DNA双向编码器表示(DNABERT)提取序列上下文特征,辅助视图利用混沌博弈表示(CGR)提取序列结构特征。特征提取包括四个部分:数据增强、多视图编码器、投影头和对比损失函数。该模型通过实现双差分数据增强策略,构建多视图网络结构进行特征处理和融合,通过最大化正样本相似度和最小化负样本相似度来学习对数据增强不变的判别特征表示。这有效地解决了样本有限的特征学习场景。在样本有限的m7G数据集上的实验结果表明,MCLCBA的AUROC和AUPRC分别达到85.64%和86.94%,在这两个指标上都比现有方法提高了5-6%。结论:MCLCBA通过多视角对比学习,为样本有限的RM站点提供了一种方法。
{"title":"MCLCBA: multi-view contrastive learning network for RNA methylation site prediction.","authors":"Honglei Wang, Xuesong Zhang, Yanjing Sun, Zhaoyang Liu, Lin Zhang","doi":"10.1186/s12859-025-06306-x","DOIUrl":"10.1186/s12859-025-06306-x","url":null,"abstract":"<p><strong>Background: </strong>RNA methylation (RM) regulates gene expression regulation, RNA stability, and protein translation. Accurate prediction of RM modification sites is essential for understanding their biological functions. However, existing wet-lab detection techniques face challenges including operational complexity and high costs. Deep learning (DL) methods have been applied to this task. However, existing methods show performance degradation with smaller training datasets. For instance, the Bidirectional Gated Recurrent Unit (BGRU) demonstrates substantial performance degradation. Contrastive Learning Network (CNN) can extract local pattern features but learns overly specific patterns with sample-limited data, resulting in poor feature generalization. Bidirectional Long Short-Term Memory (BiLSTM) excels at modeling long-range dependencies but cannot sufficiently learn gating mechanism parameters to capture effective sequence representations with limited samples. Transformer processes sequences in parallel and captures global dependencies through self-attention, but its quadratic computational complexity and large parameter count make it prone to overfitting on small datasets. Current DL methods show reduced performance when training data is limited.</p><p><strong>Results: </strong>This study proposes a Multi-view Contrastive Learning with CNN-BiLSTM-Attention (MCLCBA) framework for RM modification site prediction. The multi-view approach comprises a primary view and auxiliary view, where the primary view utilizes DNA Bidirectional Encoder Representations from Transformers (DNABERT) to extract sequence contextual features, and the auxiliary view employs Chaos Game Representation (CGR) to extract structural features. Feature extraction includes four components: data augmentation, multi-view encoders, projection heads, and contrastive loss functions. By implementing dual differential data augmentation strategies and constructing multi-view network architectures for feature processing and fusion, the model learns discriminative feature representations invariant to data augmentation through maximizing positive sample similarity while minimizing negative sample similarity. This effectively addresses sample-limited feature learning scenarios. Experimental results on the sample-limited m<sup>7</sup>G dataset demonstrate that MCLCBA achieves AUROC and AUPRC of 85.64% and 86.94%, respectively, improving upon existing methods by 5-6% in both metrics.</p><p><strong>Conclusions: </strong>Through multi-view contrastive learning, MCLCBA provides an approach for RM sites under sample-limited scenarios.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"281"},"PeriodicalIF":3.3,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12628535/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1186/s12859-025-06296-w
Qinhuan Luo, Yongzhen Yu, Tianying Wang
Background: Single-cell RNA sequencing (scRNA-seq) provides extensive opportunities to explore cellular heterogeneity but is often limited by substantial technical noise and variability. The prevalence of zero counts, arising from both biological variation and technical dropout events, poses significant challenges for downstream analyses. Existing imputation methods face inherent trade-offs: statistical approaches maintain interpretability but exhibit limited capacity for capturing complex, non-linear gene expression relationships, whereas deep learning methods demonstrate superior flexibility but are prone to overfitting and lack mechanistic interpretability, particularly in settings with limited sample sizes.
Methods: We present ZILLNB (Zero-Inflated Latent factors Learning-based Negative Binomial), a novel computational framework that integrates zero-inflated negative binomial (ZINB) regression with deep generative modeling. ZILLNB employs an ensemble architecture combining Information Variational Autoencoder (InfoVAE) and Generative Adversarial Network (GAN) to learn latent representations at cellular and gene levels. These latent factors serve as dynamic covariates within a ZINB regression framework, with parameters iteratively optimized through an Expectation-Maximization algorithm. This approach enables systematic decomposition of technical variability from intrinsic biological heterogeneity.
Results: Comparative evaluations across multiple scRNA-seq datasets demonstrate ZILLNB's superior performance. In cell type classification tasks using mouse cortex and human PBMC datasets, ZILLNB achieved the highest Adjusted Rand index (ARI) and Adjusted Mutual Information (AMI) among tested methods, with improvements ranging from 0.05 to 0.2 over VIPER, scImpute, DCA, DeepImpute, SAVER, scMultiGAN and ALRA. For differential expression analysis validated against matched bulk RNA-seq data, ZILLNB demonstrated improvements ranging from 0.05 to 0.3 for area under the Receiver Operating Characteristic curve (AUC-ROC) and the Precision-Recall curve (AUC-PR) compared to standard and other imputation methods, with consistently lower false discovery rates. Application to idiopathic pulmonary fibrosis (IPF) datasets revealed distinct fibroblast subpopulations undergoing fibroblast-to-myofibroblast transition, validated through marker gene expression and pathway enrichment analyses.
Conclusion: ZILLNB provides a principled framework for addressing technical artifacts in scRNA-seq data while preserving biological variation. The integration of statistical modeling with deep learning enables robust performance across diverse analytical tasks, including cell type identification, differential expression analysis, and rare cell population discovery, demonstrating utility across common single-cell analysis tasks.
{"title":"Denoising single-cell RNA-seq data with a deep learning-embedded statistical framework.","authors":"Qinhuan Luo, Yongzhen Yu, Tianying Wang","doi":"10.1186/s12859-025-06296-w","DOIUrl":"10.1186/s12859-025-06296-w","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNA-seq) provides extensive opportunities to explore cellular heterogeneity but is often limited by substantial technical noise and variability. The prevalence of zero counts, arising from both biological variation and technical dropout events, poses significant challenges for downstream analyses. Existing imputation methods face inherent trade-offs: statistical approaches maintain interpretability but exhibit limited capacity for capturing complex, non-linear gene expression relationships, whereas deep learning methods demonstrate superior flexibility but are prone to overfitting and lack mechanistic interpretability, particularly in settings with limited sample sizes.</p><p><strong>Methods: </strong>We present ZILLNB (Zero-Inflated Latent factors Learning-based Negative Binomial), a novel computational framework that integrates zero-inflated negative binomial (ZINB) regression with deep generative modeling. ZILLNB employs an ensemble architecture combining Information Variational Autoencoder (InfoVAE) and Generative Adversarial Network (GAN) to learn latent representations at cellular and gene levels. These latent factors serve as dynamic covariates within a ZINB regression framework, with parameters iteratively optimized through an Expectation-Maximization algorithm. This approach enables systematic decomposition of technical variability from intrinsic biological heterogeneity.</p><p><strong>Results: </strong>Comparative evaluations across multiple scRNA-seq datasets demonstrate ZILLNB's superior performance. In cell type classification tasks using mouse cortex and human PBMC datasets, ZILLNB achieved the highest Adjusted Rand index (ARI) and Adjusted Mutual Information (AMI) among tested methods, with improvements ranging from 0.05 to 0.2 over VIPER, scImpute, DCA, DeepImpute, SAVER, scMultiGAN and ALRA. For differential expression analysis validated against matched bulk RNA-seq data, ZILLNB demonstrated improvements ranging from 0.05 to 0.3 for area under the Receiver Operating Characteristic curve (AUC-ROC) and the Precision-Recall curve (AUC-PR) compared to standard and other imputation methods, with consistently lower false discovery rates. Application to idiopathic pulmonary fibrosis (IPF) datasets revealed distinct fibroblast subpopulations undergoing fibroblast-to-myofibroblast transition, validated through marker gene expression and pathway enrichment analyses.</p><p><strong>Conclusion: </strong>ZILLNB provides a principled framework for addressing technical artifacts in scRNA-seq data while preserving biological variation. The integration of statistical modeling with deep learning enables robust performance across diverse analytical tasks, including cell type identification, differential expression analysis, and rare cell population discovery, demonstrating utility across common single-cell analysis tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"282"},"PeriodicalIF":3.3,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12629073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06302-1
Jiajia Liu, Surendra S Negi, Chengyuan Yang, Xiaobo Zhou, Catherine H Schein, Werner Braun, Pora Kim
{"title":"AllergenAI: a deep learning model predicting allergenicity based on protein sequence.","authors":"Jiajia Liu, Surendra S Negi, Chengyuan Yang, Xiaobo Zhou, Catherine H Schein, Werner Braun, Pora Kim","doi":"10.1186/s12859-025-06302-1","DOIUrl":"10.1186/s12859-025-06302-1","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"279"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06309-8
Nan Sun, Yu Wang, Xiang Shi, Dengcheng Yang, Rongling Wu, Stephen S-T Yau
Accurate cell type classification is critical for downstream analysis in single-cell RNA sequencing (scRNA-seq). Most existing methods rely on a single type of feature representation-such as statistical, information theory, matrix factorization, or deep learning-based features. However, each captures different aspects of the data, and no single feature type can fully represent the complex differences between cell types. Moreover, naïvely concatenating multiple features may introduce redundancy or noise, reducing model performance. To address these challenges, we propose scMFF, which is a multiple feature fusion framework that integrates four features and explores six fusion strategies in combination with various classifiers for single-cell type classification. Comprehensive evaluations on 42 disease-related datasets and an external COVID-19 dataset demonstrate that scMFF outperforms single-feature approaches in terms of performance and stability, providing a reliable and effective solution for scRNA-seq data analysis.
{"title":"scMFF: a machine learning framework with multiple feature fusion strategies for cell type identification.","authors":"Nan Sun, Yu Wang, Xiang Shi, Dengcheng Yang, Rongling Wu, Stephen S-T Yau","doi":"10.1186/s12859-025-06309-8","DOIUrl":"10.1186/s12859-025-06309-8","url":null,"abstract":"<p><p>Accurate cell type classification is critical for downstream analysis in single-cell RNA sequencing (scRNA-seq). Most existing methods rely on a single type of feature representation-such as statistical, information theory, matrix factorization, or deep learning-based features. However, each captures different aspects of the data, and no single feature type can fully represent the complex differences between cell types. Moreover, naïvely concatenating multiple features may introduce redundancy or noise, reducing model performance. To address these challenges, we propose scMFF, which is a multiple feature fusion framework that integrates four features and explores six fusion strategies in combination with various classifiers for single-cell type classification. Comprehensive evaluations on 42 disease-related datasets and an external COVID-19 dataset demonstrate that scMFF outperforms single-feature approaches in terms of performance and stability, providing a reliable and effective solution for scRNA-seq data analysis.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"277"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625116/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06310-1
Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey
Background: Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise.
Results: In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ .
Conclusions: In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.
{"title":"Talk2Biomodels: AI agent-based open-source LLM initiative for kinetic biological models.","authors":"Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey","doi":"10.1186/s12859-025-06310-1","DOIUrl":"10.1186/s12859-025-06310-1","url":null,"abstract":"<p><strong>Background: </strong>Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise.</p><p><strong>Results: </strong>In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ .</p><p><strong>Conclusions: </strong>In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"276"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625589/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06287-x
Ze Chen, Julie Blommaert, Yi Mei, Linley Jesson, Maren Wellenreuther, Mengjie Zhang
Background: Chrysophrys auratus (family: Sparidae), commonly known as Australasian snapper, is a warm-water species being developed as a candidate for aquaculture in New Zealand. Genomic selection of elite snapper offers significant potential to accelerate genetic gains in aquaculture; however, the complexity of genetic architecture, coupled with challenges such as missing data and high dimensionality, poses significant hurdles. Machine learning techniques have emerged as powerful tools in genomic selection programmes due to their flexibility and ability to model complex, polygenic and non-linear relationships between genotypes and traits. This study aims to develop a comprehensive machine learning framework to evaluate imputation methods and genomic prediction models, and identify single-nucleotide polymorphisms associated with growth traits in snapper, ultimately contributing to the advancement of selective breeding programmes.
Results: We evaluated multiple approaches for each component of the machine learning framework. We developed and evaluated the Domain Knowledge-based K-nearest neighbour (DK-KNN) imputation method, achieving a notably high imputation accuracy of 98.33% in simulation testing, outperforming two alternative imputation methods. Among feature selection and classification combinations evaluated for growth prediction, Chi-squared feature selection paired with Distance-Weighted Discrimination (Chi2-DWD) achieved 60% prediction accuracy, comparable to genomic best linear unbiased prediction (60.3%) but without requiring the genomic relationship matrix. Notably, the two-stage approach using Domain Knowledge-based Pre-filtering (DK Pre-filtering) as a pre-filter did not substantially impact prediction accuracy, and it proved valuable in reducing the dimensionality of the feature space without affecting model performance.
Conclusions: Integration of domain knowledge into machine learning frameworks effectively addresses missing values and high-dimensional challenges in snapper genomic data. The evaluated framework demonstrates that Chi2-DWD represents a promising combination for genomic prediction tasks. The DK Pre-filtering workflow as a pre-filtering method successfully removes redundant features without affecting model performance. Selected features showed biological significance and were confirmed to be associated with growth traits based on biological analysis, providing valuable insights for selective breeding programs.
{"title":"Machine learning for genomic prediction of growth traits in aquaculture: a case study of the Australasian snapper (Chrysophrys auratus).","authors":"Ze Chen, Julie Blommaert, Yi Mei, Linley Jesson, Maren Wellenreuther, Mengjie Zhang","doi":"10.1186/s12859-025-06287-x","DOIUrl":"10.1186/s12859-025-06287-x","url":null,"abstract":"<p><strong>Background: </strong>Chrysophrys auratus (family: Sparidae), commonly known as Australasian snapper, is a warm-water species being developed as a candidate for aquaculture in New Zealand. Genomic selection of elite snapper offers significant potential to accelerate genetic gains in aquaculture; however, the complexity of genetic architecture, coupled with challenges such as missing data and high dimensionality, poses significant hurdles. Machine learning techniques have emerged as powerful tools in genomic selection programmes due to their flexibility and ability to model complex, polygenic and non-linear relationships between genotypes and traits. This study aims to develop a comprehensive machine learning framework to evaluate imputation methods and genomic prediction models, and identify single-nucleotide polymorphisms associated with growth traits in snapper, ultimately contributing to the advancement of selective breeding programmes.</p><p><strong>Results: </strong>We evaluated multiple approaches for each component of the machine learning framework. We developed and evaluated the Domain Knowledge-based K-nearest neighbour (DK-KNN) imputation method, achieving a notably high imputation accuracy of 98.33% in simulation testing, outperforming two alternative imputation methods. Among feature selection and classification combinations evaluated for growth prediction, Chi-squared feature selection paired with Distance-Weighted Discrimination (Chi2-DWD) achieved 60% prediction accuracy, comparable to genomic best linear unbiased prediction (60.3%) but without requiring the genomic relationship matrix. Notably, the two-stage approach using Domain Knowledge-based Pre-filtering (DK Pre-filtering) as a pre-filter did not substantially impact prediction accuracy, and it proved valuable in reducing the dimensionality of the feature space without affecting model performance.</p><p><strong>Conclusions: </strong>Integration of domain knowledge into machine learning frameworks effectively addresses missing values and high-dimensional challenges in snapper genomic data. The evaluated framework demonstrates that Chi2-DWD represents a promising combination for genomic prediction tasks. The DK Pre-filtering workflow as a pre-filtering method successfully removes redundant features without affecting model performance. Selected features showed biological significance and were confirmed to be associated with growth traits based on biological analysis, providing valuable insights for selective breeding programs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"278"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06297-9
Elijah R Bring Horvath, Jaclyn M Winter
Background: The rapid increase in publicly available microbial and metagenomic data has created a growing demand for tools that can efficiently perform custom large-scale comparative searches and functional annotation. While BLAST + remains the standard for sequence similarity searches, population-level studies often require custom scripting and manual curation of results, which can present barriers for many researchers.
Results: We developed SeqForge, a scalable, modular command-line toolkit that streamlines alignment-based searches and motif mining across large genomic datasets. SeqForge automates BLAST + database creation and querying, integrates amino acid motif discovery, enables sequence and contig extraction, and curates results into structured, easily parsed formats. The platform supports diverse input formats, parallelized execution for high-performance computing environments, and built-in visualization tools. Benchmarking demonstrates that SeqForge achieves near-linear runtime scaling for computationally intensive modules while maintaining modest memory usage.
Conclusions: SeqForge lowers the computational barrier for large-scale meta/genomic exploration, enabling researchers to perform population-scale BLAST searches, motif detection, and sequence curation without custom scripting. The toolkit is freely available and platform-independent, making it suitable for both personal workstations and high-performance computing environments.
{"title":"SeqForge: a scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets.","authors":"Elijah R Bring Horvath, Jaclyn M Winter","doi":"10.1186/s12859-025-06297-9","DOIUrl":"10.1186/s12859-025-06297-9","url":null,"abstract":"<p><strong>Background: </strong>The rapid increase in publicly available microbial and metagenomic data has created a growing demand for tools that can efficiently perform custom large-scale comparative searches and functional annotation. While BLAST + remains the standard for sequence similarity searches, population-level studies often require custom scripting and manual curation of results, which can present barriers for many researchers.</p><p><strong>Results: </strong>We developed SeqForge, a scalable, modular command-line toolkit that streamlines alignment-based searches and motif mining across large genomic datasets. SeqForge automates BLAST + database creation and querying, integrates amino acid motif discovery, enables sequence and contig extraction, and curates results into structured, easily parsed formats. The platform supports diverse input formats, parallelized execution for high-performance computing environments, and built-in visualization tools. Benchmarking demonstrates that SeqForge achieves near-linear runtime scaling for computationally intensive modules while maintaining modest memory usage.</p><p><strong>Conclusions: </strong>SeqForge lowers the computational barrier for large-scale meta/genomic exploration, enabling researchers to perform population-scale BLAST searches, motif detection, and sequence curation without custom scripting. The toolkit is freely available and platform-independent, making it suitable for both personal workstations and high-performance computing environments.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"280"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Predicting drug-target interactions (DTIs) plays a pivotal role in accelerating drug repositioning by prioritizing candidate drugs and reducing experimental costs. Despite advancements in deep learning, several challenges still require further exploration, including sparsity and inadequate representation of feature relationships.
Results: We propose GCNMM, a novel graph convolutional network based on meta-paths and mutual information, to predict latent DTIs in drug-target heterogeneous networks. Our approach begins by constructing a fused DTI network based on meta-paths and a graph attention network. We compute multiple similarity networks by using Jaccard coefficients and integrate them into the fused drug and target similarity networks through entropy-based fusion. These networks are then jointly processed by graph convolutional auto-encoder to generate low-dimensional feature representations. To preserve the topological structure of the original network in the embedding space and strengthen the relationship between the input and latent representations, we incorporate spatial topological consistency and mutual information maximization as dual optimization objectives.
Conclusions: The experimental results illustrate that GCNMM exhibits superior performance to existing baseline models in DTI prediction. Furthermore, case studies validate the practical effectiveness of GCNMM, highlighting its potential in DTI prediction and drug repositioning.
{"title":"Graph convolution network based on meta-paths and mutual information for drug-target interaction prediction.","authors":"Shujuan Cao, Binying Cai, Zhejian Qiu, Tiantian Chang, Qiqige Wuyun, Fang-Xiang Wu","doi":"10.1186/s12859-025-06295-x","DOIUrl":"10.1186/s12859-025-06295-x","url":null,"abstract":"<p><strong>Background: </strong>Predicting drug-target interactions (DTIs) plays a pivotal role in accelerating drug repositioning by prioritizing candidate drugs and reducing experimental costs. Despite advancements in deep learning, several challenges still require further exploration, including sparsity and inadequate representation of feature relationships.</p><p><strong>Results: </strong>We propose GCNMM, a novel graph convolutional network based on meta-paths and mutual information, to predict latent DTIs in drug-target heterogeneous networks. Our approach begins by constructing a fused DTI network based on meta-paths and a graph attention network. We compute multiple similarity networks by using Jaccard coefficients and integrate them into the fused drug and target similarity networks through entropy-based fusion. These networks are then jointly processed by graph convolutional auto-encoder to generate low-dimensional feature representations. To preserve the topological structure of the original network in the embedding space and strengthen the relationship between the input and latent representations, we incorporate spatial topological consistency and mutual information maximization as dual optimization objectives.</p><p><strong>Conclusions: </strong>The experimental results illustrate that GCNMM exhibits superior performance to existing baseline models in DTI prediction. Furthermore, case studies validate the practical effectiveness of GCNMM, highlighting its potential in DTI prediction and drug repositioning.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"275"},"PeriodicalIF":3.3,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12595897/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145470547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1186/s12859-025-06099-z
Shuo Shuo Liu, Shikun Wang, Yuxuan Chen, Anil K Rustgi, Ming Yuan, Jianhua Hu
Background: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data.
Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods.
Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.
{"title":"TransST: transfer learning embedded spatial factor modeling of spatial transcriptomics data.","authors":"Shuo Shuo Liu, Shikun Wang, Yuxuan Chen, Anil K Rustgi, Ming Yuan, Jianhua Hu","doi":"10.1186/s12859-025-06099-z","DOIUrl":"10.1186/s12859-025-06099-z","url":null,"abstract":"<p><strong>Background: </strong>Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data.</p><p><strong>Results: </strong>Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods.</p><p><strong>Conclusions: </strong>In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"274"},"PeriodicalIF":3.3,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12593783/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145457374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Identifying potential associations among food, gut microbiota and disease is fundamental for elucidating interaction mechanisms and advancing personalized healthy dietary strategies. While computational methods have been extensively applied to predict microbiota-disease associations, methods on predicting food-microbiota relationships remain limited, particularly regarding higher-order food-microbiota-disease interactions.
Results: In this work, we construct a food-microbe-disease (FMD) database encompassing 190 food items, 219 gut microbiota species, and 163 disease entities, resulting in 17,065 FMD associations. We then propose a lightweight single-view contrastive learning hypergraph neural network (LSCHNN) for FMD association prediction on the sparse FMD dataset. LSCHNN formulates ternary FMD interactions as a hypergraph, in which foods, microbes, and diseases are represented by nodes and FMD triplets are represented by hyperedges, and leverages the biological features of foods, microbes, and diseases as node attributes. Subsequently, a hypergraph neural network is designed to learn the embeddings of foods, microbes, and diseases from the hypergraph and predict potential ternary FMD associations. Additionally, we incorporate a single-view contrastive learning mechanism that enhances the model's ability to extract discriminative features and improves generalization on sparse data. Comprehensive comparison experiments demonstrate that LSCHNN outperforms other state-of-the-art methods in terms of the precision of predicting ternary FMD associations and discovering more potential FMD associations. Case studies on two microbes further confirm the effectiveness of LSCHNN in identifying potential FMD associations.
Conclusions: A novel computational model, LSCHNN, is proposed, marking the first integration of hypergraph neural networks with lightweight single-view contrastive learning for ternary FMD association prediction, providing a groundbreaking framework for precision nutrition and personalized dietary interventions.
{"title":"A lightweight single-view contrastive learning hypergraph neural network for food-microbe-disease association prediction.","authors":"Jianqiang Hu, Mingyi Hu, Yangxiang Wu, Songyao Mu, Dahao Huang, Baolong Wang, Yuchen Gao, Shixin Gu, Jinlin Zhu","doi":"10.1186/s12859-025-06283-1","DOIUrl":"10.1186/s12859-025-06283-1","url":null,"abstract":"<p><strong>Background: </strong>Identifying potential associations among food, gut microbiota and disease is fundamental for elucidating interaction mechanisms and advancing personalized healthy dietary strategies. While computational methods have been extensively applied to predict microbiota-disease associations, methods on predicting food-microbiota relationships remain limited, particularly regarding higher-order food-microbiota-disease interactions.</p><p><strong>Results: </strong>In this work, we construct a food-microbe-disease (FMD) database encompassing 190 food items, 219 gut microbiota species, and 163 disease entities, resulting in 17,065 FMD associations. We then propose a lightweight single-view contrastive learning hypergraph neural network (LSCHNN) for FMD association prediction on the sparse FMD dataset. LSCHNN formulates ternary FMD interactions as a hypergraph, in which foods, microbes, and diseases are represented by nodes and FMD triplets are represented by hyperedges, and leverages the biological features of foods, microbes, and diseases as node attributes. Subsequently, a hypergraph neural network is designed to learn the embeddings of foods, microbes, and diseases from the hypergraph and predict potential ternary FMD associations. Additionally, we incorporate a single-view contrastive learning mechanism that enhances the model's ability to extract discriminative features and improves generalization on sparse data. Comprehensive comparison experiments demonstrate that LSCHNN outperforms other state-of-the-art methods in terms of the precision of predicting ternary FMD associations and discovering more potential FMD associations. Case studies on two microbes further confirm the effectiveness of LSCHNN in identifying potential FMD associations.</p><p><strong>Conclusions: </strong>A novel computational model, LSCHNN, is proposed, marking the first integration of hypergraph neural networks with lightweight single-view contrastive learning for ternary FMD association prediction, providing a groundbreaking framework for precision nutrition and personalized dietary interventions.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"273"},"PeriodicalIF":3.3,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12584493/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145443977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}