Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06325-8
Alisson Silva, Carlos Marquez, Iury Godoy, Lucas Silva, Matheus Prado, Murilo Beppler, Natanael Avila, Bruno Travençolo, Anderson R Santos
Background: Computational prediction of protein-protein interactions (PPIs) is crucial for understanding cell biology and drug development, offering an alternative to costly experimental methods. The original GenPPi software advanced ab initio PPI network prediction from bacterial genomes but was limited by its reliance on high sequence similarity. This work introduces GenPPi 1.5 to enhance these predictive capabilities.
Results: GenPPi 1.5 incorporates a Random Forest (RF) algorithm, trained on 60 biophysical features from amino acid propensity indices, to classify protein similarity even in low sequence identity scenarios (targeting >65% identity). To manage computational complexity from the increased interactions generated by the RF model, especially in extensive conserved phylogenetic profiles, we developed and integrated the Reduced Interaction Sampling (RIS) algorithm. RIS stochastically samples interactions within these profiles, optimizing performance for complete genome analysis. Extensive simulations across various configurations validated the methodology. RF integration significantly broadened GenPPi's predictive power; application to Buchnera aphidicola showed up to 62% overlap with STRING database interactions. Analysis of RIS demonstrated that while introducing some randomness, critical node identification remains robust, particularly for Top_N values ≥ 100, indicating minimal compromise to network integrity.
Conclusion: The combination of Machine Learning (RF) and the RIS algorithm in GenPPi 1.5 represents a significant advancement. It overcomes the high-similarity dependency of the previous version while efficiently handling complex genomes. GenPPi 1.5 provides a robust and scalable alignment-free PPI prediction solution, enabling users to train custom models tailored to specific genomic contexts. GenPPi is freely available on our website https://genppi.facom.ufu.br/ , its source code is hosted on GitHub https://github.com/santosardr/genppi , and it can be easily installed via the Python Package Index using the command pip install genppi-py.
{"title":"Improving protein interaction prediction in GenPPi: a novel interaction sampling approach preserving network topology.","authors":"Alisson Silva, Carlos Marquez, Iury Godoy, Lucas Silva, Matheus Prado, Murilo Beppler, Natanael Avila, Bruno Travençolo, Anderson R Santos","doi":"10.1186/s12859-025-06325-8","DOIUrl":"10.1186/s12859-025-06325-8","url":null,"abstract":"<p><strong>Background: </strong>Computational prediction of protein-protein interactions (PPIs) is crucial for understanding cell biology and drug development, offering an alternative to costly experimental methods. The original GenPPi software advanced ab initio PPI network prediction from bacterial genomes but was limited by its reliance on high sequence similarity. This work introduces GenPPi 1.5 to enhance these predictive capabilities.</p><p><strong>Results: </strong>GenPPi 1.5 incorporates a Random Forest (RF) algorithm, trained on 60 biophysical features from amino acid propensity indices, to classify protein similarity even in low sequence identity scenarios (targeting >65% identity). To manage computational complexity from the increased interactions generated by the RF model, especially in extensive conserved phylogenetic profiles, we developed and integrated the Reduced Interaction Sampling (RIS) algorithm. RIS stochastically samples interactions within these profiles, optimizing performance for complete genome analysis. Extensive simulations across various configurations validated the methodology. RF integration significantly broadened GenPPi's predictive power; application to Buchnera aphidicola showed up to 62% overlap with STRING database interactions. Analysis of RIS demonstrated that while introducing some randomness, critical node identification remains robust, particularly for Top_N values ≥ 100, indicating minimal compromise to network integrity.</p><p><strong>Conclusion: </strong>The combination of Machine Learning (RF) and the RIS algorithm in GenPPi 1.5 represents a significant advancement. It overcomes the high-similarity dependency of the previous version while efficiently handling complex genomes. GenPPi 1.5 provides a robust and scalable alignment-free PPI prediction solution, enabling users to train custom models tailored to specific genomic contexts. GenPPi is freely available on our website https://genppi.facom.ufu.br/ , its source code is hosted on GitHub https://github.com/santosardr/genppi , and it can be easily installed via the Python Package Index using the command pip install genppi-py.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"296"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751606/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06316-9
Matthew Massett, Adrian Carr
Background: Protein Language Models (PLMs) are emerging as powerful tools for designing human proteins, including antibodies. These models can predict the effects of mutations in a zero-shot setting-without requiring additional fine-tuning-and suggest plausible amino acid substitutions.
Results: We introduce Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein which provides several DirectedEvolution classes that introduce amino acid substitutions in a stepwise manner. Each substitution is evaluated using one of two scoring strategies, and the most promising candidates are sampled accordingly. Users can customize the number of evolution steps, specify target regions within the protein sequence, and set score thresholds to filter out low-quality substitutions during the design process.
Conclusion: Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein is a fast and flexible tool for in silico protein design. It introduces a consistent and efficient probabilistic framework that leverages any masked language modeling Protein Language Model (PLM) available via Hugging Face. Unlike existing tools, Prodigy Protein can integrate multiple PLMs to design protein variants-an approach not currently supported by other publicly available software.
{"title":"Prodigy protein: Python package for zero-shot protein engineering using protein language models.","authors":"Matthew Massett, Adrian Carr","doi":"10.1186/s12859-025-06316-9","DOIUrl":"10.1186/s12859-025-06316-9","url":null,"abstract":"<p><strong>Background: </strong>Protein Language Models (PLMs) are emerging as powerful tools for designing human proteins, including antibodies. These models can predict the effects of mutations in a zero-shot setting-without requiring additional fine-tuning-and suggest plausible amino acid substitutions.</p><p><strong>Results: </strong>We introduce Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein which provides several DirectedEvolution classes that introduce amino acid substitutions in a stepwise manner. Each substitution is evaluated using one of two scoring strategies, and the most promising candidates are sampled accordingly. Users can customize the number of evolution steps, specify target regions within the protein sequence, and set score thresholds to filter out low-quality substitutions during the design process.</p><p><strong>Conclusion: </strong>Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein is a fast and flexible tool for in silico protein design. It introduces a consistent and efficient probabilistic framework that leverages any masked language modeling Protein Language Model (PLM) available via Hugging Face. Unlike existing tools, Prodigy Protein can integrate multiple PLMs to design protein variants-an approach not currently supported by other publicly available software.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"298"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06317-8
Xianyong Zhou, Xindian Wei, Cheng Liu, Wenjun Shen, Ping Xuan, Si Wu, Hau-San Wong
Single-cell RNA sequencing (scRNA-seq) technology has transformed gene expression studies by enabling analysis at the individual cell level, offering unprecedented insights into cellular heterogeneity. A key challenge in scRNA-seq data analysis is cell type identification, which requires grouping cells with similar gene expression profiles using unsupervised clustering methods. However, the high dimensionality, inherent noise, and significant sparsity of scRNA-seq data present substantial obstacles to accurately determining relationships among cell samples. To address these challenges, we propose a novel deep subspace clustering approach for cell type identification that captures a more reliable subspace structure from scRNA-seq data. Our method leverages a robust self-representation learning framework to effectively characterize and learn the underlying cluster structure. This framework is optimized through an integrated strategy combining a structure-guided approach with an optimal transport algorithm, enhancing the robustness of the subspace clustering process. By mitigating the effects of noise and sparsity in scRNA-seq data, this approach enables more accurate cell clustering. Experimental results on 18 real scRNA-seq datasets demonstrate that our method outperforms several state-of-the-art clustering approaches tailored for scRNA-seq data, excelling in both accuracy and interpretability.
{"title":"Robust subspace structure discovery for cell type identification in scRNA-seq data.","authors":"Xianyong Zhou, Xindian Wei, Cheng Liu, Wenjun Shen, Ping Xuan, Si Wu, Hau-San Wong","doi":"10.1186/s12859-025-06317-8","DOIUrl":"10.1186/s12859-025-06317-8","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) technology has transformed gene expression studies by enabling analysis at the individual cell level, offering unprecedented insights into cellular heterogeneity. A key challenge in scRNA-seq data analysis is cell type identification, which requires grouping cells with similar gene expression profiles using unsupervised clustering methods. However, the high dimensionality, inherent noise, and significant sparsity of scRNA-seq data present substantial obstacles to accurately determining relationships among cell samples. To address these challenges, we propose a novel deep subspace clustering approach for cell type identification that captures a more reliable subspace structure from scRNA-seq data. Our method leverages a robust self-representation learning framework to effectively characterize and learn the underlying cluster structure. This framework is optimized through an integrated strategy combining a structure-guided approach with an optimal transport algorithm, enhancing the robustness of the subspace clustering process. By mitigating the effects of noise and sparsity in scRNA-seq data, this approach enables more accurate cell clustering. Experimental results on 18 real scRNA-seq datasets demonstrate that our method outperforms several state-of-the-art clustering approaches tailored for scRNA-seq data, excelling in both accuracy and interpretability.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"295"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12752283/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06318-7
Junrong Song, Yuanli Gong, Zhiming Song, Xinggui Xu, Kun Qian, Yingbo Liu
Background: Cancer's complexity and heterogeneity pose significant challenges for personalized treatment. Accurate classification of patients into molecular subtypes is critical for targeted therapy and improved outcomes. However, existing methods often fail to simultaneously capture inter-patient heterogeneity and shared molecular patterns in driver gene profiles.
Results: To address this limitation, we propose DriverSub-SVM, a novel framework for interpretable cancer subtype classification that integrates patient-specific and cohort-wide driver gene information. Our method first models the bidirectional influence between mutated and dysregulated genes via a random walk on a functional interaction network. It then applies Bayesian Personalized Ranking (BPR) to infer personalized driver gene rankings for each patient. These rankings are aggregated into a consensus driver gene set using the Condorcet. Subsequently, a One-Against-One Multiclass Support Vector Machine (OAO-MSVM) classifies patients based on their gene-level profiles. Evaluated on multiple TCGA datasets, DriverSub-SVM outperformed four state-of-the-art methods, achieving higher accuracy and identifying clinically relevant genes associated with prognosis and therapeutic response.
Conclusion: DriverSub-SVM offers an effective and interpretable approach for cancer subtype classification by bridging individual heterogeneity and population-level patterns. It enhances understanding of tumor biology and holds promise for precision oncology and biomarker discovery. The source code is available at https://github.com/sjunrong/DriverSub-SVM .
{"title":"DriverSub-SVM: a machine learning approach for cancer subtype classification by integrating patient-specific and global driver genes.","authors":"Junrong Song, Yuanli Gong, Zhiming Song, Xinggui Xu, Kun Qian, Yingbo Liu","doi":"10.1186/s12859-025-06318-7","DOIUrl":"10.1186/s12859-025-06318-7","url":null,"abstract":"<p><strong>Background: </strong>Cancer's complexity and heterogeneity pose significant challenges for personalized treatment. Accurate classification of patients into molecular subtypes is critical for targeted therapy and improved outcomes. However, existing methods often fail to simultaneously capture inter-patient heterogeneity and shared molecular patterns in driver gene profiles.</p><p><strong>Results: </strong>To address this limitation, we propose DriverSub-SVM, a novel framework for interpretable cancer subtype classification that integrates patient-specific and cohort-wide driver gene information. Our method first models the bidirectional influence between mutated and dysregulated genes via a random walk on a functional interaction network. It then applies Bayesian Personalized Ranking (BPR) to infer personalized driver gene rankings for each patient. These rankings are aggregated into a consensus driver gene set using the Condorcet. Subsequently, a One-Against-One Multiclass Support Vector Machine (OAO-MSVM) classifies patients based on their gene-level profiles. Evaluated on multiple TCGA datasets, DriverSub-SVM outperformed four state-of-the-art methods, achieving higher accuracy and identifying clinically relevant genes associated with prognosis and therapeutic response.</p><p><strong>Conclusion: </strong>DriverSub-SVM offers an effective and interpretable approach for cancer subtype classification by bridging individual heterogeneity and population-level patterns. It enhances understanding of tumor biology and holds promise for precision oncology and biomarker discovery. The source code is available at https://github.com/sjunrong/DriverSub-SVM .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"297"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751776/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06336-5
Yiming Ma
Background: DNA data storage offers exceptional density and longevity, but its practicality is hampered by the high cost and low throughput of de novo DNA synthesis. A key cost driver in array-based synthesis is the length of a common supersequence required to encode a batch of DNA strands.
Objective: This study aims to address this cost bottleneck by investigating the optimal batch partitioning of DNA sequences. Our goal is to minimize the total synthesis cost, which is defined as the sum of the lengths of the shortest common supersequences (SCS) across all batches.
Results: Given a large pool [Formula: see text] of balanced binary sequences, which is partitioned into k batches with almost equal size, we define the total cost of [Formula: see text] to be the sum of lengths of the shortest common supersequence (SCS) of all sequences in each batch. The central problem is to determine the minimum total cost of [Formula: see text], denoted by [Formula: see text], among all partitions into k batches.
Conclusions: When [Formula: see text] is the set of all balanced binary sequences of length 2n, we use combinatorial methods to obtain [Formula: see text] for any positive n, and [Formula: see text] for [Formula: see text] and large n with C a constant depending on k. Similarly, we get [Formula: see text] for [Formula: see text] and large n when [Formula: see text] is the set of all balanced DNA sequences of length 2n. Previously, the probabilistic model of this problem was studied by Makarychev et al. (IEEE Trans Inf Theory 68:7454-7470, 2022), where strings are unconstrained or without consecutive identical letters.
背景:DNA数据存储具有卓越的密度和寿命,但其实用性受到高成本和低通量从头DNA合成的阻碍。在基于阵列的合成中,一个关键的成本驱动因素是编码一批DNA链所需的共同超序列的长度。目的:本研究旨在通过研究DNA序列的最佳批量分配来解决这一成本瓶颈。我们的目标是最小化总合成成本,其定义为所有批次中最短共同超序列(SCS)长度的总和。结果:给定一个大的平衡二值序列池[公式:见文],它被划分为k个几乎相等大小的批次,我们定义[公式:见文]的总代价为每批次中所有序列的最短公共超序列(SCS)的长度之和。中心问题是确定[公式:见文]的最小总成本,用[公式:见文]表示,在所有分区中分成k批。结论:当[公式:见文]是长度为2n的所有平衡二值序列的集合时,对于任意正n,我们使用组合方法得到[公式:见文],对于[公式:见文]和大n,我们使用[公式:见文],并且C是一个依赖于k的常数。同样,当[公式:见文]是长度为2n的所有平衡DNA序列的集合时,我们得到[公式:见文]和大n。此前,Makarychev等人(IEEE Trans Inf Theory 68:7454-7470, 2022)研究了该问题的概率模型,其中字符串不受约束或没有连续相同的字母。
{"title":"Batch optimization for balanced binary sequences and DNA sequences.","authors":"Yiming Ma","doi":"10.1186/s12859-025-06336-5","DOIUrl":"10.1186/s12859-025-06336-5","url":null,"abstract":"<p><strong>Background: </strong>DNA data storage offers exceptional density and longevity, but its practicality is hampered by the high cost and low throughput of de novo DNA synthesis. A key cost driver in array-based synthesis is the length of a common supersequence required to encode a batch of DNA strands.</p><p><strong>Objective: </strong>This study aims to address this cost bottleneck by investigating the optimal batch partitioning of DNA sequences. Our goal is to minimize the total synthesis cost, which is defined as the sum of the lengths of the shortest common supersequences (SCS) across all batches.</p><p><strong>Results: </strong>Given a large pool [Formula: see text] of balanced binary sequences, which is partitioned into k batches with almost equal size, we define the total cost of [Formula: see text] to be the sum of lengths of the shortest common supersequence (SCS) of all sequences in each batch. The central problem is to determine the minimum total cost of [Formula: see text], denoted by [Formula: see text], among all partitions into k batches.</p><p><strong>Conclusions: </strong>When [Formula: see text] is the set of all balanced binary sequences of length 2n, we use combinatorial methods to obtain [Formula: see text] for any positive n, and [Formula: see text] for [Formula: see text] and large n with C a constant depending on k. Similarly, we get [Formula: see text] for [Formula: see text] and large n when [Formula: see text] is the set of all balanced DNA sequences of length 2n. Previously, the probabilistic model of this problem was studied by Makarychev et al. (IEEE Trans Inf Theory 68:7454-7470, 2022), where strings are unconstrained or without consecutive identical letters.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"300"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12750718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Cancer is a complex disease that arises from the simultaneous mutations of multiple biological molecules. An effective therapeutic strategy is to exploit synthetic lethality (SL) by targeting the SL partner of cancer driver genes. Computational approaches have emerged as efficient complements to traditional methods. Although some methods integrate heterogeneous sources to learn multi-network representations, they often neglect consistent information shared across different networks and specific characteristic specific to individual network. Therefore, a comprehensive representation learning framework for capturing both multi-network consistency and network-specific information of gene pair is needed.
Results: We proposed a novel approach capturing Multi-network consistent and specific representation with Generative Adversarial Network for Synthetic Lethality prediction (MGANSL). MGANSL employs network-aligned and network-specific encoding modules to cooperatively learn comprehensive multi-network representations of gene pair. In particular, network-aligned encoding module can capture cross-modal consistent information via cross-network adversarial generation, and network-specific encoding module can capture single network specific information via intra-network adversarial generation.
Conclusions: Comprehensive experiments conducted on two human synthetic lethality datasets demonstrate the superiority of proposed method in SL prediction. Moreover, the novel predicted SL associations could aid in designing anti-cancer drugs and providing potential drug targets.
{"title":"MGANSL: multi-network representation generating with generative adversarial network for synthetic lethality prediction.","authors":"Jinxin Li, Xinguo Lu, Zihao Li, Xing Liu, Hongrui Liu, Jingjing Ruan","doi":"10.1186/s12859-025-06345-4","DOIUrl":"10.1186/s12859-025-06345-4","url":null,"abstract":"<p><strong>Background: </strong>Cancer is a complex disease that arises from the simultaneous mutations of multiple biological molecules. An effective therapeutic strategy is to exploit synthetic lethality (SL) by targeting the SL partner of cancer driver genes. Computational approaches have emerged as efficient complements to traditional methods. Although some methods integrate heterogeneous sources to learn multi-network representations, they often neglect consistent information shared across different networks and specific characteristic specific to individual network. Therefore, a comprehensive representation learning framework for capturing both multi-network consistency and network-specific information of gene pair is needed.</p><p><strong>Results: </strong>We proposed a novel approach capturing Multi-network consistent and specific representation with Generative Adversarial Network for Synthetic Lethality prediction (MGANSL). MGANSL employs network-aligned and network-specific encoding modules to cooperatively learn comprehensive multi-network representations of gene pair. In particular, network-aligned encoding module can capture cross-modal consistent information via cross-network adversarial generation, and network-specific encoding module can capture single network specific information via intra-network adversarial generation.</p><p><strong>Conclusions: </strong>Comprehensive experiments conducted on two human synthetic lethality datasets demonstrate the superiority of proposed method in SL prediction. Moreover, the novel predicted SL associations could aid in designing anti-cancer drugs and providing potential drug targets.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"27"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12860022/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-27DOI: 10.1186/s12859-025-06344-5
Daniel Zyss, Amritansh Sharma, Susana A Ribeiro, Claire E Repellin, Oliver Lai, Mary J C Ludlam, Thomas Walter, Amin Fehri
{"title":"Contrastive learning for cell division detection and tracking in live cell imaging data.","authors":"Daniel Zyss, Amritansh Sharma, Susana A Ribeiro, Claire E Repellin, Oliver Lai, Mary J C Ludlam, Thomas Walter, Amin Fehri","doi":"10.1186/s12859-025-06344-5","DOIUrl":"10.1186/s12859-025-06344-5","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"30"},"PeriodicalIF":3.3,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12859858/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145846402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advances in metagenomic sequencing have increasingly implicated gut microbiome dysbiosis in numerous complex diseases, yet its application for precise differential diagnosis remains a major challenge. Existing computational approaches often show limited predictive performance and insufficient robustness when applied to large-scale, imbalanced microbiome datasets, and they typically lack mechanisms to effectively capture microbial community-level or functional guild interactions. To address these limitations, we developed AR-CDT Net, a novel deep learning framework that integrates a Multi-Scale Deformable Convolution (MS-DConv) module with a Channel-wise Dynamic Tanh (CD-Tanh) activation function to achieve more accurate and robust classification of host disease states. Evaluated on a large-scale cohort comprising over 8000 samples spanning eight disease phenotypes, AR-CDT Net demonstrated highly competitive within-cohort performance, outperforming nine representative models across the majority of classification tasks. Importantly, in a stringent cross-dataset generalization test, the model was trained on the highly imbalanced primary multi-disease cohort and validated on relatively balanced independent external cohorts. It achieved a statistically significant AUC of 0.7921 on the highly heterogeneous external T2D cohort, confirming that AR-CDT captures transferable biological signals rather than dataset-specific artifacts. Furthermore, by combining dimensionality reduction with SHAP-based interpretation of our One-vs-Rest (OvR) classifiers, AR-CDT disentangles disease-specific pathogenic signatures from the shared dysbiotic background among clinically distinct yet microbially similar diseases.
{"title":"AR-CDT NET: a deep deformable convolutional network for gut microbiome-based disease classification.","authors":"Jiaye Li, Zijian Sun, Shuo Chai, Hangming Li, Yijun Wang, Jingkui Tian","doi":"10.1186/s12859-025-06357-0","DOIUrl":"10.1186/s12859-025-06357-0","url":null,"abstract":"<p><p>Advances in metagenomic sequencing have increasingly implicated gut microbiome dysbiosis in numerous complex diseases, yet its application for precise differential diagnosis remains a major challenge. Existing computational approaches often show limited predictive performance and insufficient robustness when applied to large-scale, imbalanced microbiome datasets, and they typically lack mechanisms to effectively capture microbial community-level or functional guild interactions. To address these limitations, we developed AR-CDT Net, a novel deep learning framework that integrates a Multi-Scale Deformable Convolution (MS-DConv) module with a Channel-wise Dynamic Tanh (CD-Tanh) activation function to achieve more accurate and robust classification of host disease states. Evaluated on a large-scale cohort comprising over 8000 samples spanning eight disease phenotypes, AR-CDT Net demonstrated highly competitive within-cohort performance, outperforming nine representative models across the majority of classification tasks. Importantly, in a stringent cross-dataset generalization test, the model was trained on the highly imbalanced primary multi-disease cohort and validated on relatively balanced independent external cohorts. It achieved a statistically significant AUC of 0.7921 on the highly heterogeneous external T2D cohort, confirming that AR-CDT captures transferable biological signals rather than dataset-specific artifacts. Furthermore, by combining dimensionality reduction with SHAP-based interpretation of our One-vs-Rest (OvR) classifiers, AR-CDT disentangles disease-specific pathogenic signatures from the shared dysbiotic background among clinically distinct yet microbially similar diseases.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"23"},"PeriodicalIF":3.3,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12849458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145843427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-24DOI: 10.1186/s12859-025-06349-0
He Li, Zander Gu, Said El Bouhaddani, Jeanine Houwing-Duistermaat
Background: In studies that aim to model the relationship between an outcome variable and multiple omics datasets, it is often desirable to reduce the dimensionality of these datasets or to represent one omics dataset in terms of another. Several approaches exist for this purpose, including univariate methods such as polygenic scores, and multivariate methods. Multivariate approaches offer advantages by producing lower-dimensional integrative scores, capturing joint structures across datasets, and filtering out dataset-specific noise. In this paper, we describe one univariate and two multivariate methods, and evaluate their performance through simulations involving two correlated multivariate normally distributed omics datasets, as well as a combination of one multivariate normal and one fixed categorical dataset.
Results: We assess method performance using the root mean squared error (RMSE) when modelling the outcome variable as a function of the reduced omics representations. Multivariate methods generally perform well, particularly when a slightly higher number of components is used for integration. They outperform the univariate method in scenarios involving two normally distributed omics datasets and perform comparably in settings with one normal and one categorical dataset. In real data applications, including two metabolomics datasets from TwinsUK and a metabolomics-genetic dataset from ORCADES, all methods show similar performance in modelling body mass index.
Conclusions: Multivariate methods provide a valuable framework for summarizing multi-omics datasets into low-dimensional components suitable for outcome modelling. Even in the presence of non-normal data, these methods offer a promising alternative to high-dimensional univariate approaches.
{"title":"Statistical modelling of an outcome variable with integrated multi-omics.","authors":"He Li, Zander Gu, Said El Bouhaddani, Jeanine Houwing-Duistermaat","doi":"10.1186/s12859-025-06349-0","DOIUrl":"10.1186/s12859-025-06349-0","url":null,"abstract":"<p><strong>Background: </strong>In studies that aim to model the relationship between an outcome variable and multiple omics datasets, it is often desirable to reduce the dimensionality of these datasets or to represent one omics dataset in terms of another. Several approaches exist for this purpose, including univariate methods such as polygenic scores, and multivariate methods. Multivariate approaches offer advantages by producing lower-dimensional integrative scores, capturing joint structures across datasets, and filtering out dataset-specific noise. In this paper, we describe one univariate and two multivariate methods, and evaluate their performance through simulations involving two correlated multivariate normally distributed omics datasets, as well as a combination of one multivariate normal and one fixed categorical dataset.</p><p><strong>Results: </strong>We assess method performance using the root mean squared error (RMSE) when modelling the outcome variable as a function of the reduced omics representations. Multivariate methods generally perform well, particularly when a slightly higher number of components is used for integration. They outperform the univariate method in scenarios involving two normally distributed omics datasets and perform comparably in settings with one normal and one categorical dataset. In real data applications, including two metabolomics datasets from TwinsUK and a metabolomics-genetic dataset from ORCADES, all methods show similar performance in modelling body mass index.</p><p><strong>Conclusions: </strong>Multivariate methods provide a valuable framework for summarizing multi-omics datasets into low-dimensional components suitable for outcome modelling. Even in the presence of non-normal data, these methods offer a promising alternative to high-dimensional univariate approaches.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"26"},"PeriodicalIF":3.3,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12859906/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145826816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid development of single-cell sequencing technologies has provided a robust technical support for the efficient resolution of multiple levels of molecular information from a single-cell population. However, the data produced by these technologies often contain a lot of noise and differences in characteristics that make it difficult to integrate and analyze single-cell multi-omics data. In this study, there is a growing demand for methods to integrate single-cell multi-omics data, which is expected to enhance the ability to reveal cellular heterogeneity and provide new biological perspectives for a deeper understanding of cellular phenotypes by jointly analyzing multi-omics data. We propose LONMF, a non-negative matrix factorization algorithm combining graph Laplacian and optimal transmission to enhance clustering performance and interpretability. We apply LONMF to visualize and cluster multi-pair single-cell multi-omics data, including 10X-multi-group, CITE-seq, and TEA-multi-group seq, to facilitate marker characterization and gene ontology enrichment analysis and to provide rich biological insights for downstream analyses. Our comprehensive benchmarking demonstrates that LONMF exhibits comparable performance compared with the current state-of-the-art in cell clustering and outperforms other methods in terms of biological interpretability.
{"title":"LONMF: a non-negative matrix factorization model based on graph Laplacian and optimal transmission for paired single-cell multi-omics data integration.","authors":"Mengdi Nan, Qing Ren, Yuhan Fu, Xiang Chen, Guanpeng Qi, Liugen Wang, Jie Gao","doi":"10.1186/s12859-025-06301-2","DOIUrl":"10.1186/s12859-025-06301-2","url":null,"abstract":"<p><p>The rapid development of single-cell sequencing technologies has provided a robust technical support for the efficient resolution of multiple levels of molecular information from a single-cell population. However, the data produced by these technologies often contain a lot of noise and differences in characteristics that make it difficult to integrate and analyze single-cell multi-omics data. In this study, there is a growing demand for methods to integrate single-cell multi-omics data, which is expected to enhance the ability to reveal cellular heterogeneity and provide new biological perspectives for a deeper understanding of cellular phenotypes by jointly analyzing multi-omics data. We propose LONMF, a non-negative matrix factorization algorithm combining graph Laplacian and optimal transmission to enhance clustering performance and interpretability. We apply LONMF to visualize and cluster multi-pair single-cell multi-omics data, including 10X-multi-group, CITE-seq, and TEA-multi-group seq, to facilitate marker characterization and gene ontology enrichment analysis and to provide rich biological insights for downstream analyses. Our comprehensive benchmarking demonstrates that LONMF exhibits comparable performance compared with the current state-of-the-art in cell clustering and outperforms other methods in terms of biological interpretability.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"294"},"PeriodicalIF":3.3,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12729160/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145817499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}