Pub Date : 2025-11-25DOI: 10.1186/s12859-025-06215-z
Amr Mohamed, Kevin H Lee
As data complexity and volume increase rapidly, efficient statistical methods for identifying significant variables become crucial. Variable selection plays a vital role in establishing relationships between predictors and response variables. The challenge lies in achieving this goal while controlling the False Discovery Rate (FDR) and maintaining statistical power. The knockoff filter, a recent approach, generates inexpensive knockoff variables that mimic the correlation structure of the original variables, serving as negative controls for inference. In this study, we extend the use of knockoffs to Light Gradient Boosting Machine (LightGBM), a fast and accurate machine learning technique. Shapely Additive Explanations (SHAP) values are employed to interpret the black-box nature of machine learning. Through extensive experimentation, our proposed method outperforms traditional approaches, accurately identifying important variables for each class. It offers improved speed and efficiency across multiple datasets. To validate our approach, an extensive simulation study is conducted. The integration of knockoffs into LightGBM enhances performance and interpretability, contributing to the advancement of variable selection methods. Our research addresses the challenges of variable selection in the era of big data, providing a valuable tool for identifying relevant variables in statistical modeling and machine learning applications.
{"title":"Gradient boosting with knockoff filters: a biostatistical approach to variable selection.","authors":"Amr Mohamed, Kevin H Lee","doi":"10.1186/s12859-025-06215-z","DOIUrl":"10.1186/s12859-025-06215-z","url":null,"abstract":"<p><p>As data complexity and volume increase rapidly, efficient statistical methods for identifying significant variables become crucial. Variable selection plays a vital role in establishing relationships between predictors and response variables. The challenge lies in achieving this goal while controlling the False Discovery Rate (FDR) and maintaining statistical power. The knockoff filter, a recent approach, generates inexpensive knockoff variables that mimic the correlation structure of the original variables, serving as negative controls for inference. In this study, we extend the use of knockoffs to Light Gradient Boosting Machine (LightGBM), a fast and accurate machine learning technique. Shapely Additive Explanations (SHAP) values are employed to interpret the black-box nature of machine learning. Through extensive experimentation, our proposed method outperforms traditional approaches, accurately identifying important variables for each class. It offers improved speed and efficiency across multiple datasets. To validate our approach, an extensive simulation study is conducted. The integration of knockoffs into LightGBM enhances performance and interpretability, contributing to the advancement of variable selection methods. Our research addresses the challenges of variable selection in the era of big data, providing a valuable tool for identifying relevant variables in statistical modeling and machine learning applications.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"13"},"PeriodicalIF":3.3,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12801829/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145602121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1186/s12859-025-06324-9
Peng Shen, Yiyu Lin, Sen Yang, Ziding Zhang
Background: Accurately identifying RNA 2'-O-methylation (2OM) sites is a crucial step in gaining an in-depth understanding of RNA regulatory mechanisms. Although there are currently multiple prediction tools available, they still suffer from limited prediction accuracy and an inability to fully capture the associations between sequences and sites.
Results: This study constructs a novel low-redundancy dataset and innovatively proposes the KN-PairMatrix encoding scheme, effectively addressing the research gap in sequence-site association analysis. Based on this foundation, we developed the deep learning framework OMetaNet, which integrates residual and downsampling-optimized CNN modules, Mamba network, and a proprietary cross-modal interactive fusion module. The framework incorporates a contrastive learning-driven adaptive hybrid loss function. Employing a progressive feature disentanglement strategy, it enhances the learning capability for 2OM site-specific patterns. Independent evaluation results demonstrate that OMetaNet significantly outperforms existing methods in predicting 2OM sites across all four nucleotide types.
Conclusions: We proposed a novel computational model, OMetaNet. Its unique design structure may potentially reshape the paradigm of transcriptome analysis, open up new directions for extracting modification site information, and show significant potential in biomarker research and cross-species generalization studies.
背景:准确识别RNA 2'- o -甲基化(2OM)位点是深入了解RNA调控机制的关键一步。虽然目前有多种可用的预测工具,但它们仍然存在预测精度有限和无法完全捕获序列和位点之间关联的问题。结果:本研究构建了一个新颖的低冗余数据集,并创新性地提出了KN-PairMatrix编码方案,有效解决了序列位点关联分析的研究空白。在此基础上,我们开发了深度学习框架OMetaNet,该框架集成了残差和下采样优化的CNN模块、Mamba网络和专有的跨模态交互融合模块。该框架结合了一个对比学习驱动的自适应混合损失函数。采用渐进式特征解缠策略,增强了对2OM位点特定模式的学习能力。独立评估结果表明,在预测所有四种核苷酸类型的2OM位点方面,OMetaNet显著优于现有方法。结论:我们提出了一种新的计算模型——OMetaNet。其独特的设计结构可能会重塑转录组分析的范式,为提取修饰位点信息开辟新的方向,并在生物标志物研究和跨物种推广研究中显示出巨大的潜力。
{"title":"OMetaNet: an efficient hybrid deep learning model based on multimodal data fusion and contrastive learning for predicting 2'-O-methylation sites in human RNA.","authors":"Peng Shen, Yiyu Lin, Sen Yang, Ziding Zhang","doi":"10.1186/s12859-025-06324-9","DOIUrl":"10.1186/s12859-025-06324-9","url":null,"abstract":"<p><strong>Background: </strong>Accurately identifying RNA 2'-O-methylation (2OM) sites is a crucial step in gaining an in-depth understanding of RNA regulatory mechanisms. Although there are currently multiple prediction tools available, they still suffer from limited prediction accuracy and an inability to fully capture the associations between sequences and sites.</p><p><strong>Results: </strong>This study constructs a novel low-redundancy dataset and innovatively proposes the KN-PairMatrix encoding scheme, effectively addressing the research gap in sequence-site association analysis. Based on this foundation, we developed the deep learning framework OMetaNet, which integrates residual and downsampling-optimized CNN modules, Mamba network, and a proprietary cross-modal interactive fusion module. The framework incorporates a contrastive learning-driven adaptive hybrid loss function. Employing a progressive feature disentanglement strategy, it enhances the learning capability for 2OM site-specific patterns. Independent evaluation results demonstrate that OMetaNet significantly outperforms existing methods in predicting 2OM sites across all four nucleotide types.</p><p><strong>Conclusions: </strong>We proposed a novel computational model, OMetaNet. Its unique design structure may potentially reshape the paradigm of transcriptome analysis, open up new directions for extracting modification site information, and show significant potential in biomarker research and cross-species generalization studies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"304"},"PeriodicalIF":3.3,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12752101/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145595607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23DOI: 10.1186/s12859-025-06299-7
El Hacene Djaout, Nicolas Cluzel, Vincent Marechal, Gregory Nuel, Marie Courbariaux
{"title":"Varaps: a python package for estimating SARS-CoV-2 lineages proportions from pooled sequencing data (ANRS0160).","authors":"El Hacene Djaout, Nicolas Cluzel, Vincent Marechal, Gregory Nuel, Marie Courbariaux","doi":"10.1186/s12859-025-06299-7","DOIUrl":"10.1186/s12859-025-06299-7","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"302"},"PeriodicalIF":3.3,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751500/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145585584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Protein-protein interactions regulate the dynamic operation of intracellular molecular networks, serving as the molecular basis for revealing protein functions and disease mechanisms. Recently, several computational methods for predicting protein-protein interaction sites (PPIs) have been presented as alternatives to costly and labor-intensive traditional experiments. However, existing methods generally ignore the inherent hierarchical structure of protein chains. Furthermore, the equivariance of graph structure during spatial transformations is often neglected when applying graph neural networks to modeling. Therefore, accurately identifying PPIs remains a challenging task.
Results: In this work, we propose an end-to-end GNN-based computational method, EGCPPIS, for efficiently identifying protein-protein interaction sites. First, we construct a hierarchical graph representation of the protein chain, including residue-level graph and atom-level graph. Next, EGCPPIS designs an E(n) Equivariant Graph Neural Network (EGNN) module to learn residue-level embeddings with equivariant features. After further extracting atom-level embeddings using the GraphSAGE module, we introduce the contrastive learning strategy to integrate hierarchical graph features. This strategy enables us to learn consistent embeddings between residue-level and atom-level representations. Finally, the fused embeddings are weighted using an improved gated multi-head attention mechanism.
Conclusion: Comprehensive evaluation results on multiple datasets demonstrate that EGCPPIS significantly outperforms state-of-the-art methods. Extensive comparative experiments and case studies further confirm that EGCPPIS can reveal the decision-making patterns in PPIs prediction, facilitating the discovery of potential PPIs. The original datasets and code of EGCPPIS are available at https://github.com/GuicongSun/EGCPPIS .
{"title":"EGCPPIS: learning hierarchical equivariant graph representations with contrastive integration for protein-protein interaction site identification.","authors":"Guicong Sun, Yongxian Fan, Yangfeng Zhu, Mengxin Zheng","doi":"10.1186/s12859-025-06328-5","DOIUrl":"10.1186/s12859-025-06328-5","url":null,"abstract":"<p><strong>Background: </strong>Protein-protein interactions regulate the dynamic operation of intracellular molecular networks, serving as the molecular basis for revealing protein functions and disease mechanisms. Recently, several computational methods for predicting protein-protein interaction sites (PPIs) have been presented as alternatives to costly and labor-intensive traditional experiments. However, existing methods generally ignore the inherent hierarchical structure of protein chains. Furthermore, the equivariance of graph structure during spatial transformations is often neglected when applying graph neural networks to modeling. Therefore, accurately identifying PPIs remains a challenging task.</p><p><strong>Results: </strong>In this work, we propose an end-to-end GNN-based computational method, EGCPPIS, for efficiently identifying protein-protein interaction sites. First, we construct a hierarchical graph representation of the protein chain, including residue-level graph and atom-level graph. Next, EGCPPIS designs an E(n) Equivariant Graph Neural Network (EGNN) module to learn residue-level embeddings with equivariant features. After further extracting atom-level embeddings using the GraphSAGE module, we introduce the contrastive learning strategy to integrate hierarchical graph features. This strategy enables us to learn consistent embeddings between residue-level and atom-level representations. Finally, the fused embeddings are weighted using an improved gated multi-head attention mechanism.</p><p><strong>Conclusion: </strong>Comprehensive evaluation results on multiple datasets demonstrate that EGCPPIS significantly outperforms state-of-the-art methods. Extensive comparative experiments and case studies further confirm that EGCPPIS can reveal the decision-making patterns in PPIs prediction, facilitating the discovery of potential PPIs. The original datasets and code of EGCPPIS are available at https://github.com/GuicongSun/EGCPPIS .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"303"},"PeriodicalIF":3.3,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751822/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145586251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21DOI: 10.1186/s12859-025-06314-x
Maryam Mehrabani, Amir Lakizadeh, Alireza Fotuhi Siahpirani, Ali Masoudi-Nejad
{"title":"SynergyImage: image-based model for drug combinations synergy score prediction.","authors":"Maryam Mehrabani, Amir Lakizadeh, Alireza Fotuhi Siahpirani, Ali Masoudi-Nejad","doi":"10.1186/s12859-025-06314-x","DOIUrl":"10.1186/s12859-025-06314-x","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"283"},"PeriodicalIF":3.3,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12639979/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145572843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1186/s12859-025-06306-x
Honglei Wang, Xuesong Zhang, Yanjing Sun, Zhaoyang Liu, Lin Zhang
Background: RNA methylation (RM) regulates gene expression regulation, RNA stability, and protein translation. Accurate prediction of RM modification sites is essential for understanding their biological functions. However, existing wet-lab detection techniques face challenges including operational complexity and high costs. Deep learning (DL) methods have been applied to this task. However, existing methods show performance degradation with smaller training datasets. For instance, the Bidirectional Gated Recurrent Unit (BGRU) demonstrates substantial performance degradation. Contrastive Learning Network (CNN) can extract local pattern features but learns overly specific patterns with sample-limited data, resulting in poor feature generalization. Bidirectional Long Short-Term Memory (BiLSTM) excels at modeling long-range dependencies but cannot sufficiently learn gating mechanism parameters to capture effective sequence representations with limited samples. Transformer processes sequences in parallel and captures global dependencies through self-attention, but its quadratic computational complexity and large parameter count make it prone to overfitting on small datasets. Current DL methods show reduced performance when training data is limited.
Results: This study proposes a Multi-view Contrastive Learning with CNN-BiLSTM-Attention (MCLCBA) framework for RM modification site prediction. The multi-view approach comprises a primary view and auxiliary view, where the primary view utilizes DNA Bidirectional Encoder Representations from Transformers (DNABERT) to extract sequence contextual features, and the auxiliary view employs Chaos Game Representation (CGR) to extract structural features. Feature extraction includes four components: data augmentation, multi-view encoders, projection heads, and contrastive loss functions. By implementing dual differential data augmentation strategies and constructing multi-view network architectures for feature processing and fusion, the model learns discriminative feature representations invariant to data augmentation through maximizing positive sample similarity while minimizing negative sample similarity. This effectively addresses sample-limited feature learning scenarios. Experimental results on the sample-limited m7G dataset demonstrate that MCLCBA achieves AUROC and AUPRC of 85.64% and 86.94%, respectively, improving upon existing methods by 5-6% in both metrics.
Conclusions: Through multi-view contrastive learning, MCLCBA provides an approach for RM sites under sample-limited scenarios.
背景:RNA甲基化(RM)调节基因表达调控、RNA稳定性和蛋白质翻译。准确预测RM修饰位点对了解其生物学功能至关重要。然而,现有的湿实验室检测技术面临着操作复杂性和高成本等挑战。深度学习(DL)方法已应用于此任务。然而,现有的方法在较小的训练数据集上表现出性能下降。例如,双向门控循环单元(BGRU)表现出明显的性能下降。对比学习网络(CNN)可以提取局部模式特征,但在样本有限的数据下学习过于特定的模式,导致特征泛化效果较差。双向长短期记忆(Bidirectional Long - short Memory, BiLSTM)擅长对长时间依赖关系进行建模,但无法充分学习门控机制参数,无法在有限的样本中捕获有效的序列表示。Transformer并行处理序列并通过自关注捕获全局依赖关系,但其二次计算复杂性和大参数计数使其容易在小数据集上过拟合。当前的深度学习方法在训练数据有限的情况下表现出较低的性能。结果:本研究提出了一种基于CNN-BiLSTM-Attention (MCLCBA)的多视角对比学习框架,用于RM修饰位点预测。多视图方法包括主视图和辅助视图,其中主视图利用变形变压器DNA双向编码器表示(DNABERT)提取序列上下文特征,辅助视图利用混沌博弈表示(CGR)提取序列结构特征。特征提取包括四个部分:数据增强、多视图编码器、投影头和对比损失函数。该模型通过实现双差分数据增强策略,构建多视图网络结构进行特征处理和融合,通过最大化正样本相似度和最小化负样本相似度来学习对数据增强不变的判别特征表示。这有效地解决了样本有限的特征学习场景。在样本有限的m7G数据集上的实验结果表明,MCLCBA的AUROC和AUPRC分别达到85.64%和86.94%,在这两个指标上都比现有方法提高了5-6%。结论:MCLCBA通过多视角对比学习,为样本有限的RM站点提供了一种方法。
{"title":"MCLCBA: multi-view contrastive learning network for RNA methylation site prediction.","authors":"Honglei Wang, Xuesong Zhang, Yanjing Sun, Zhaoyang Liu, Lin Zhang","doi":"10.1186/s12859-025-06306-x","DOIUrl":"10.1186/s12859-025-06306-x","url":null,"abstract":"<p><strong>Background: </strong>RNA methylation (RM) regulates gene expression regulation, RNA stability, and protein translation. Accurate prediction of RM modification sites is essential for understanding their biological functions. However, existing wet-lab detection techniques face challenges including operational complexity and high costs. Deep learning (DL) methods have been applied to this task. However, existing methods show performance degradation with smaller training datasets. For instance, the Bidirectional Gated Recurrent Unit (BGRU) demonstrates substantial performance degradation. Contrastive Learning Network (CNN) can extract local pattern features but learns overly specific patterns with sample-limited data, resulting in poor feature generalization. Bidirectional Long Short-Term Memory (BiLSTM) excels at modeling long-range dependencies but cannot sufficiently learn gating mechanism parameters to capture effective sequence representations with limited samples. Transformer processes sequences in parallel and captures global dependencies through self-attention, but its quadratic computational complexity and large parameter count make it prone to overfitting on small datasets. Current DL methods show reduced performance when training data is limited.</p><p><strong>Results: </strong>This study proposes a Multi-view Contrastive Learning with CNN-BiLSTM-Attention (MCLCBA) framework for RM modification site prediction. The multi-view approach comprises a primary view and auxiliary view, where the primary view utilizes DNA Bidirectional Encoder Representations from Transformers (DNABERT) to extract sequence contextual features, and the auxiliary view employs Chaos Game Representation (CGR) to extract structural features. Feature extraction includes four components: data augmentation, multi-view encoders, projection heads, and contrastive loss functions. By implementing dual differential data augmentation strategies and constructing multi-view network architectures for feature processing and fusion, the model learns discriminative feature representations invariant to data augmentation through maximizing positive sample similarity while minimizing negative sample similarity. This effectively addresses sample-limited feature learning scenarios. Experimental results on the sample-limited m<sup>7</sup>G dataset demonstrate that MCLCBA achieves AUROC and AUPRC of 85.64% and 86.94%, respectively, improving upon existing methods by 5-6% in both metrics.</p><p><strong>Conclusions: </strong>Through multi-view contrastive learning, MCLCBA provides an approach for RM sites under sample-limited scenarios.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"281"},"PeriodicalIF":3.3,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12628535/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1186/s12859-025-06296-w
Qinhuan Luo, Yongzhen Yu, Tianying Wang
Background: Single-cell RNA sequencing (scRNA-seq) provides extensive opportunities to explore cellular heterogeneity but is often limited by substantial technical noise and variability. The prevalence of zero counts, arising from both biological variation and technical dropout events, poses significant challenges for downstream analyses. Existing imputation methods face inherent trade-offs: statistical approaches maintain interpretability but exhibit limited capacity for capturing complex, non-linear gene expression relationships, whereas deep learning methods demonstrate superior flexibility but are prone to overfitting and lack mechanistic interpretability, particularly in settings with limited sample sizes.
Methods: We present ZILLNB (Zero-Inflated Latent factors Learning-based Negative Binomial), a novel computational framework that integrates zero-inflated negative binomial (ZINB) regression with deep generative modeling. ZILLNB employs an ensemble architecture combining Information Variational Autoencoder (InfoVAE) and Generative Adversarial Network (GAN) to learn latent representations at cellular and gene levels. These latent factors serve as dynamic covariates within a ZINB regression framework, with parameters iteratively optimized through an Expectation-Maximization algorithm. This approach enables systematic decomposition of technical variability from intrinsic biological heterogeneity.
Results: Comparative evaluations across multiple scRNA-seq datasets demonstrate ZILLNB's superior performance. In cell type classification tasks using mouse cortex and human PBMC datasets, ZILLNB achieved the highest Adjusted Rand index (ARI) and Adjusted Mutual Information (AMI) among tested methods, with improvements ranging from 0.05 to 0.2 over VIPER, scImpute, DCA, DeepImpute, SAVER, scMultiGAN and ALRA. For differential expression analysis validated against matched bulk RNA-seq data, ZILLNB demonstrated improvements ranging from 0.05 to 0.3 for area under the Receiver Operating Characteristic curve (AUC-ROC) and the Precision-Recall curve (AUC-PR) compared to standard and other imputation methods, with consistently lower false discovery rates. Application to idiopathic pulmonary fibrosis (IPF) datasets revealed distinct fibroblast subpopulations undergoing fibroblast-to-myofibroblast transition, validated through marker gene expression and pathway enrichment analyses.
Conclusion: ZILLNB provides a principled framework for addressing technical artifacts in scRNA-seq data while preserving biological variation. The integration of statistical modeling with deep learning enables robust performance across diverse analytical tasks, including cell type identification, differential expression analysis, and rare cell population discovery, demonstrating utility across common single-cell analysis tasks.
{"title":"Denoising single-cell RNA-seq data with a deep learning-embedded statistical framework.","authors":"Qinhuan Luo, Yongzhen Yu, Tianying Wang","doi":"10.1186/s12859-025-06296-w","DOIUrl":"10.1186/s12859-025-06296-w","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNA-seq) provides extensive opportunities to explore cellular heterogeneity but is often limited by substantial technical noise and variability. The prevalence of zero counts, arising from both biological variation and technical dropout events, poses significant challenges for downstream analyses. Existing imputation methods face inherent trade-offs: statistical approaches maintain interpretability but exhibit limited capacity for capturing complex, non-linear gene expression relationships, whereas deep learning methods demonstrate superior flexibility but are prone to overfitting and lack mechanistic interpretability, particularly in settings with limited sample sizes.</p><p><strong>Methods: </strong>We present ZILLNB (Zero-Inflated Latent factors Learning-based Negative Binomial), a novel computational framework that integrates zero-inflated negative binomial (ZINB) regression with deep generative modeling. ZILLNB employs an ensemble architecture combining Information Variational Autoencoder (InfoVAE) and Generative Adversarial Network (GAN) to learn latent representations at cellular and gene levels. These latent factors serve as dynamic covariates within a ZINB regression framework, with parameters iteratively optimized through an Expectation-Maximization algorithm. This approach enables systematic decomposition of technical variability from intrinsic biological heterogeneity.</p><p><strong>Results: </strong>Comparative evaluations across multiple scRNA-seq datasets demonstrate ZILLNB's superior performance. In cell type classification tasks using mouse cortex and human PBMC datasets, ZILLNB achieved the highest Adjusted Rand index (ARI) and Adjusted Mutual Information (AMI) among tested methods, with improvements ranging from 0.05 to 0.2 over VIPER, scImpute, DCA, DeepImpute, SAVER, scMultiGAN and ALRA. For differential expression analysis validated against matched bulk RNA-seq data, ZILLNB demonstrated improvements ranging from 0.05 to 0.3 for area under the Receiver Operating Characteristic curve (AUC-ROC) and the Precision-Recall curve (AUC-PR) compared to standard and other imputation methods, with consistently lower false discovery rates. Application to idiopathic pulmonary fibrosis (IPF) datasets revealed distinct fibroblast subpopulations undergoing fibroblast-to-myofibroblast transition, validated through marker gene expression and pathway enrichment analyses.</p><p><strong>Conclusion: </strong>ZILLNB provides a principled framework for addressing technical artifacts in scRNA-seq data while preserving biological variation. The integration of statistical modeling with deep learning enables robust performance across diverse analytical tasks, including cell type identification, differential expression analysis, and rare cell population discovery, demonstrating utility across common single-cell analysis tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"282"},"PeriodicalIF":3.3,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12629073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06302-1
Jiajia Liu, Surendra S Negi, Chengyuan Yang, Xiaobo Zhou, Catherine H Schein, Werner Braun, Pora Kim
{"title":"AllergenAI: a deep learning model predicting allergenicity based on protein sequence.","authors":"Jiajia Liu, Surendra S Negi, Chengyuan Yang, Xiaobo Zhou, Catherine H Schein, Werner Braun, Pora Kim","doi":"10.1186/s12859-025-06302-1","DOIUrl":"10.1186/s12859-025-06302-1","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"279"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06309-8
Nan Sun, Yu Wang, Xiang Shi, Dengcheng Yang, Rongling Wu, Stephen S-T Yau
Accurate cell type classification is critical for downstream analysis in single-cell RNA sequencing (scRNA-seq). Most existing methods rely on a single type of feature representation-such as statistical, information theory, matrix factorization, or deep learning-based features. However, each captures different aspects of the data, and no single feature type can fully represent the complex differences between cell types. Moreover, naïvely concatenating multiple features may introduce redundancy or noise, reducing model performance. To address these challenges, we propose scMFF, which is a multiple feature fusion framework that integrates four features and explores six fusion strategies in combination with various classifiers for single-cell type classification. Comprehensive evaluations on 42 disease-related datasets and an external COVID-19 dataset demonstrate that scMFF outperforms single-feature approaches in terms of performance and stability, providing a reliable and effective solution for scRNA-seq data analysis.
{"title":"scMFF: a machine learning framework with multiple feature fusion strategies for cell type identification.","authors":"Nan Sun, Yu Wang, Xiang Shi, Dengcheng Yang, Rongling Wu, Stephen S-T Yau","doi":"10.1186/s12859-025-06309-8","DOIUrl":"10.1186/s12859-025-06309-8","url":null,"abstract":"<p><p>Accurate cell type classification is critical for downstream analysis in single-cell RNA sequencing (scRNA-seq). Most existing methods rely on a single type of feature representation-such as statistical, information theory, matrix factorization, or deep learning-based features. However, each captures different aspects of the data, and no single feature type can fully represent the complex differences between cell types. Moreover, naïvely concatenating multiple features may introduce redundancy or noise, reducing model performance. To address these challenges, we propose scMFF, which is a multiple feature fusion framework that integrates four features and explores six fusion strategies in combination with various classifiers for single-cell type classification. Comprehensive evaluations on 42 disease-related datasets and an external COVID-19 dataset demonstrate that scMFF outperforms single-feature approaches in terms of performance and stability, providing a reliable and effective solution for scRNA-seq data analysis.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"277"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625116/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1186/s12859-025-06310-1
Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey
Background: Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise.
Results: In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ .
Conclusions: In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.
{"title":"Talk2Biomodels: AI agent-based open-source LLM initiative for kinetic biological models.","authors":"Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey","doi":"10.1186/s12859-025-06310-1","DOIUrl":"10.1186/s12859-025-06310-1","url":null,"abstract":"<p><strong>Background: </strong>Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise.</p><p><strong>Results: </strong>In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ .</p><p><strong>Conclusions: </strong>In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"276"},"PeriodicalIF":3.3,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625589/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145547748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}