Pub Date : 2026-01-03DOI: 10.1186/s12859-025-06359-y
Sicheng He, Cheng Chen, Xianrun Pan, Gaogao Xue, Yu Yang, Juan Feng, Hasan Zulfiqar, Yang Zhang, Kejun Deng
Background: Small interfering RNA (siRNA) is a powerful tool for gene silencing, but its clinical application is limited by instability and potential immunogenicity. While chemical modification is essential to overcome these hurdles, data on chemically modified siRNAs are currently scattered, hindering rational drug design and development.
Results: We developed CMsiRNAdb, a comprehensive database integrating data resources, analytical tools, and efficacy prediction for chemically modified siRNAs. We consolidated 43,153 experimentally validated sequences and silencing efficiency data derived from 90 patents, covering 36 modification types and 13 therapeutic target genes. The database offers multi-dimensional retrieval, visualization, and batch download functions. Furthermore, we developed ModMapper, a Trie tree-based tool for precise identification of modification sites, and integrated the Cm-siRPred model for efficacy evaluation. CMsiRNAdb is freely accessible at https://cellknowledge.com.cn/CMsiRNAdb/ .
Conclusion: CMsiRNAdb provides critical data support and analytical tools for the rational design and rapid optimization of siRNA drugs. By standardizing data and offering predictive capabilities, it significantly advances the development of nucleic acid therapeutics.
{"title":"CMsiRNAdb: a database of chemically modified SiRNA silencing efficiency for nucleic acid drug design.","authors":"Sicheng He, Cheng Chen, Xianrun Pan, Gaogao Xue, Yu Yang, Juan Feng, Hasan Zulfiqar, Yang Zhang, Kejun Deng","doi":"10.1186/s12859-025-06359-y","DOIUrl":"10.1186/s12859-025-06359-y","url":null,"abstract":"<p><strong>Background: </strong>Small interfering RNA (siRNA) is a powerful tool for gene silencing, but its clinical application is limited by instability and potential immunogenicity. While chemical modification is essential to overcome these hurdles, data on chemically modified siRNAs are currently scattered, hindering rational drug design and development.</p><p><strong>Results: </strong>We developed CMsiRNAdb, a comprehensive database integrating data resources, analytical tools, and efficacy prediction for chemically modified siRNAs. We consolidated 43,153 experimentally validated sequences and silencing efficiency data derived from 90 patents, covering 36 modification types and 13 therapeutic target genes. The database offers multi-dimensional retrieval, visualization, and batch download functions. Furthermore, we developed ModMapper, a Trie tree-based tool for precise identification of modification sites, and integrated the Cm-siRPred model for efficacy evaluation. CMsiRNAdb is freely accessible at https://cellknowledge.com.cn/CMsiRNAdb/ .</p><p><strong>Conclusion: </strong>CMsiRNAdb provides critical data support and analytical tools for the rational design and rapid optimization of siRNA drugs. By standardizing data and offering predictive capabilities, it significantly advances the development of nucleic acid therapeutics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"33"},"PeriodicalIF":3.3,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145896112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1186/s12859-025-06354-3
Lucas F Jansen Klomp, Xinqi Yan, Rebecca R Snabel, Gert Jan C Veenstra, Hil G E Meijer, Janine N Post
{"title":"DANSE: a pipeline for dynamic modelling of time-series multi-omics data.","authors":"Lucas F Jansen Klomp, Xinqi Yan, Rebecca R Snabel, Gert Jan C Veenstra, Hil G E Meijer, Janine N Post","doi":"10.1186/s12859-025-06354-3","DOIUrl":"10.1186/s12859-025-06354-3","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"28"},"PeriodicalIF":3.3,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12859988/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145861835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1186/s12859-025-06355-2
Tristan Cumer, Sotiria Milia, Alexander S Leonard, Hubert Pausch
Background: Pangenome graphs integrate multiple assemblies to represent non-redundant genetic diversity. However, current evaluations of pangenome graphs rely primarily on technical parameters (e.g., total length, number of nodes/edges, growth curves), which fail to assess how effectively the graph represents homologous stretches across the integrated assemblies and how well short reads align against pangenome graph references.
Results: We introduce a novel method to quantitatively assess how well a pangenome graph represents its integrated assemblies. Our method quantifies how many single-copy and universal k-mers from the source assemblies are uniquely and completely represented within the graph nodes. Implemented in the open-source tool PG-SCUnK, this approach identifies the fractions of unique, duplicated, and split k-mers, which correlate with short read mapping rates to the pangenome graph.
Conclusions: Insights provided by PG-SCUnK facilitate the selection of appropriate parameters to build optimal reference pangenome graphs.
{"title":"PG-SCUnK: measuring pangenome graph representativeness using single-copy and universal K-mers.","authors":"Tristan Cumer, Sotiria Milia, Alexander S Leonard, Hubert Pausch","doi":"10.1186/s12859-025-06355-2","DOIUrl":"10.1186/s12859-025-06355-2","url":null,"abstract":"<p><strong>Background: </strong>Pangenome graphs integrate multiple assemblies to represent non-redundant genetic diversity. However, current evaluations of pangenome graphs rely primarily on technical parameters (e.g., total length, number of nodes/edges, growth curves), which fail to assess how effectively the graph represents homologous stretches across the integrated assemblies and how well short reads align against pangenome graph references.</p><p><strong>Results: </strong>We introduce a novel method to quantitatively assess how well a pangenome graph represents its integrated assemblies. Our method quantifies how many single-copy and universal k-mers from the source assemblies are uniquely and completely represented within the graph nodes. Implemented in the open-source tool PG-SCUnK, this approach identifies the fractions of unique, duplicated, and split k-mers, which correlate with short read mapping rates to the pangenome graph.</p><p><strong>Conclusions: </strong>Insights provided by PG-SCUnK facilitate the selection of appropriate parameters to build optimal reference pangenome graphs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"29"},"PeriodicalIF":3.3,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12859900/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145861880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1186/s12859-025-06363-2
Xiao Han, Xiaochen Cen, Zhijin Li, Xiaobo Zhou, Zhiwei Ji
Background: The circadian clock is an evolutionarily conserved system that orchestrates 24-h physiological rhythms through transcriptional and translational feedback loops. Mounting evidence suggests a bidirectional relationship between circadian rhythm alteration and disease progression, positioning the circadian clock as a potential therapeutic target. Due to the scarcity of high-resolution temporal omics data, it remains very challenging to elucidate the underlying regulatory mechanisms of the circadian system. As a practical alternative, public untimed transcriptomic datasets offer the potential to infer gene expression oscillations retrospectively. However, existing computational approaches for circadian phase estimation often suffer from limited predictive accuracy, reducing their ability to reliably reconstruct rhythmic gene expression patterns.
Results: To overcome these limitations, we develop DCPR, an unsupervised deep learning framework designed to accurately reconstruct the circadian phase from untimed transcriptomic data. Through comprehensive analyses of both simulated and real data, DCPR consistently overperforms existing methods in circadian phase estimation. Additional validations using knowledgebase mining and ex vivo experimental data further support DCPR's efficacy in reconstructing the oscillatory pattern of gene expression and detecting circadian variation.
Conclusions: Our study demonstrates that DCPR is a highly versatile tool for systematically identifying transcriptional rhythms from untimed expression data. This tool will facilitate therapeutics discovery for circadian-related behavioral and pathological disorders.
{"title":"DCPR: a deep learning framework for circadian phase reconstruction.","authors":"Xiao Han, Xiaochen Cen, Zhijin Li, Xiaobo Zhou, Zhiwei Ji","doi":"10.1186/s12859-025-06363-2","DOIUrl":"10.1186/s12859-025-06363-2","url":null,"abstract":"<p><strong>Background: </strong>The circadian clock is an evolutionarily conserved system that orchestrates 24-h physiological rhythms through transcriptional and translational feedback loops. Mounting evidence suggests a bidirectional relationship between circadian rhythm alteration and disease progression, positioning the circadian clock as a potential therapeutic target. Due to the scarcity of high-resolution temporal omics data, it remains very challenging to elucidate the underlying regulatory mechanisms of the circadian system. As a practical alternative, public untimed transcriptomic datasets offer the potential to infer gene expression oscillations retrospectively. However, existing computational approaches for circadian phase estimation often suffer from limited predictive accuracy, reducing their ability to reliably reconstruct rhythmic gene expression patterns.</p><p><strong>Results: </strong>To overcome these limitations, we develop DCPR, an unsupervised deep learning framework designed to accurately reconstruct the circadian phase from untimed transcriptomic data. Through comprehensive analyses of both simulated and real data, DCPR consistently overperforms existing methods in circadian phase estimation. Additional validations using knowledgebase mining and ex vivo experimental data further support DCPR's efficacy in reconstructing the oscillatory pattern of gene expression and detecting circadian variation.</p><p><strong>Conclusions: </strong>Our study demonstrates that DCPR is a highly versatile tool for systematically identifying transcriptional rhythms from untimed expression data. This tool will facilitate therapeutics discovery for circadian-related behavioral and pathological disorders.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"31"},"PeriodicalIF":3.3,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145861843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Drug combination is currently a promising solution in treating complex diseases due to its reducing toxicity and enhancing therapeutic efficacy. However, the accurate identification of drug combination effects remains challenging.
Results: In this work, we propose a novel directed weighted network-based approach to identify drug combinations. Specifically, the network is constructed on both drug-target and inter-target interactions, together with their directed regulation. The biological processes of drug effects propagation and attenuation are modeled, aiming to capture direct and indirect drug actions on targets. By assigning weights to nodes of regulatory effects, relative distances between node sets within network can thus be computed. These distances are then analyzed to discriminate the combinatorial efficacy of various drug combinations. Empirical evaluations validate a remarkable working performance of the proposed method. Compared to existing approaches, our method is a better alternative on the task of drug combination prediction.
Conclusion: The proposed method reports a creative and practical scheme for identifying drug combination effects. With the analysis of drug-target and inter-target regulatory relation, our method is more competitive in distinguishing the combinatorial efficacy, which mitigates the deficiencies of classical drug combination prediction models.
{"title":"A directed weighted network-based method for drug combinations identification using drug-target and inter-target regulation.","authors":"Shen Xiao, Yuhang Li, Jinwei Bai, Zhenhua Shen, Can Huang, Rongwu Xiang, Yuxuan Zhai, Xiwei Jiang","doi":"10.1186/s12859-025-06321-y","DOIUrl":"10.1186/s12859-025-06321-y","url":null,"abstract":"<p><strong>Background: </strong>Drug combination is currently a promising solution in treating complex diseases due to its reducing toxicity and enhancing therapeutic efficacy. However, the accurate identification of drug combination effects remains challenging.</p><p><strong>Results: </strong>In this work, we propose a novel directed weighted network-based approach to identify drug combinations. Specifically, the network is constructed on both drug-target and inter-target interactions, together with their directed regulation. The biological processes of drug effects propagation and attenuation are modeled, aiming to capture direct and indirect drug actions on targets. By assigning weights to nodes of regulatory effects, relative distances between node sets within network can thus be computed. These distances are then analyzed to discriminate the combinatorial efficacy of various drug combinations. Empirical evaluations validate a remarkable working performance of the proposed method. Compared to existing approaches, our method is a better alternative on the task of drug combination prediction.</p><p><strong>Conclusion: </strong>The proposed method reports a creative and practical scheme for identifying drug combination effects. With the analysis of drug-target and inter-target regulatory relation, our method is more competitive in distinguishing the combinatorial efficacy, which mitigates the deficiencies of classical drug combination prediction models.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"299"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751882/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06325-8
Alisson Silva, Carlos Marquez, Iury Godoy, Lucas Silva, Matheus Prado, Murilo Beppler, Natanael Avila, Bruno Travençolo, Anderson R Santos
Background: Computational prediction of protein-protein interactions (PPIs) is crucial for understanding cell biology and drug development, offering an alternative to costly experimental methods. The original GenPPi software advanced ab initio PPI network prediction from bacterial genomes but was limited by its reliance on high sequence similarity. This work introduces GenPPi 1.5 to enhance these predictive capabilities.
Results: GenPPi 1.5 incorporates a Random Forest (RF) algorithm, trained on 60 biophysical features from amino acid propensity indices, to classify protein similarity even in low sequence identity scenarios (targeting >65% identity). To manage computational complexity from the increased interactions generated by the RF model, especially in extensive conserved phylogenetic profiles, we developed and integrated the Reduced Interaction Sampling (RIS) algorithm. RIS stochastically samples interactions within these profiles, optimizing performance for complete genome analysis. Extensive simulations across various configurations validated the methodology. RF integration significantly broadened GenPPi's predictive power; application to Buchnera aphidicola showed up to 62% overlap with STRING database interactions. Analysis of RIS demonstrated that while introducing some randomness, critical node identification remains robust, particularly for Top_N values ≥ 100, indicating minimal compromise to network integrity.
Conclusion: The combination of Machine Learning (RF) and the RIS algorithm in GenPPi 1.5 represents a significant advancement. It overcomes the high-similarity dependency of the previous version while efficiently handling complex genomes. GenPPi 1.5 provides a robust and scalable alignment-free PPI prediction solution, enabling users to train custom models tailored to specific genomic contexts. GenPPi is freely available on our website https://genppi.facom.ufu.br/ , its source code is hosted on GitHub https://github.com/santosardr/genppi , and it can be easily installed via the Python Package Index using the command pip install genppi-py.
{"title":"Improving protein interaction prediction in GenPPi: a novel interaction sampling approach preserving network topology.","authors":"Alisson Silva, Carlos Marquez, Iury Godoy, Lucas Silva, Matheus Prado, Murilo Beppler, Natanael Avila, Bruno Travençolo, Anderson R Santos","doi":"10.1186/s12859-025-06325-8","DOIUrl":"10.1186/s12859-025-06325-8","url":null,"abstract":"<p><strong>Background: </strong>Computational prediction of protein-protein interactions (PPIs) is crucial for understanding cell biology and drug development, offering an alternative to costly experimental methods. The original GenPPi software advanced ab initio PPI network prediction from bacterial genomes but was limited by its reliance on high sequence similarity. This work introduces GenPPi 1.5 to enhance these predictive capabilities.</p><p><strong>Results: </strong>GenPPi 1.5 incorporates a Random Forest (RF) algorithm, trained on 60 biophysical features from amino acid propensity indices, to classify protein similarity even in low sequence identity scenarios (targeting >65% identity). To manage computational complexity from the increased interactions generated by the RF model, especially in extensive conserved phylogenetic profiles, we developed and integrated the Reduced Interaction Sampling (RIS) algorithm. RIS stochastically samples interactions within these profiles, optimizing performance for complete genome analysis. Extensive simulations across various configurations validated the methodology. RF integration significantly broadened GenPPi's predictive power; application to Buchnera aphidicola showed up to 62% overlap with STRING database interactions. Analysis of RIS demonstrated that while introducing some randomness, critical node identification remains robust, particularly for Top_N values ≥ 100, indicating minimal compromise to network integrity.</p><p><strong>Conclusion: </strong>The combination of Machine Learning (RF) and the RIS algorithm in GenPPi 1.5 represents a significant advancement. It overcomes the high-similarity dependency of the previous version while efficiently handling complex genomes. GenPPi 1.5 provides a robust and scalable alignment-free PPI prediction solution, enabling users to train custom models tailored to specific genomic contexts. GenPPi is freely available on our website https://genppi.facom.ufu.br/ , its source code is hosted on GitHub https://github.com/santosardr/genppi , and it can be easily installed via the Python Package Index using the command pip install genppi-py.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"296"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751606/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06316-9
Matthew Massett, Adrian Carr
Background: Protein Language Models (PLMs) are emerging as powerful tools for designing human proteins, including antibodies. These models can predict the effects of mutations in a zero-shot setting-without requiring additional fine-tuning-and suggest plausible amino acid substitutions.
Results: We introduce Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein which provides several DirectedEvolution classes that introduce amino acid substitutions in a stepwise manner. Each substitution is evaluated using one of two scoring strategies, and the most promising candidates are sampled accordingly. Users can customize the number of evolution steps, specify target regions within the protein sequence, and set score thresholds to filter out low-quality substitutions during the design process.
Conclusion: Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein is a fast and flexible tool for in silico protein design. It introduces a consistent and efficient probabilistic framework that leverages any masked language modeling Protein Language Model (PLM) available via Hugging Face. Unlike existing tools, Prodigy Protein can integrate multiple PLMs to design protein variants-an approach not currently supported by other publicly available software.
{"title":"Prodigy protein: Python package for zero-shot protein engineering using protein language models.","authors":"Matthew Massett, Adrian Carr","doi":"10.1186/s12859-025-06316-9","DOIUrl":"10.1186/s12859-025-06316-9","url":null,"abstract":"<p><strong>Background: </strong>Protein Language Models (PLMs) are emerging as powerful tools for designing human proteins, including antibodies. These models can predict the effects of mutations in a zero-shot setting-without requiring additional fine-tuning-and suggest plausible amino acid substitutions.</p><p><strong>Results: </strong>We introduce Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein which provides several DirectedEvolution classes that introduce amino acid substitutions in a stepwise manner. Each substitution is evaluated using one of two scoring strategies, and the most promising candidates are sampled accordingly. Users can customize the number of evolution steps, specify target regions within the protein sequence, and set score thresholds to filter out low-quality substitutions during the design process.</p><p><strong>Conclusion: </strong>Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein is a fast and flexible tool for in silico protein design. It introduces a consistent and efficient probabilistic framework that leverages any masked language modeling Protein Language Model (PLM) available via Hugging Face. Unlike existing tools, Prodigy Protein can integrate multiple PLMs to design protein variants-an approach not currently supported by other publicly available software.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"298"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06317-8
Xianyong Zhou, Xindian Wei, Cheng Liu, Wenjun Shen, Ping Xuan, Si Wu, Hau-San Wong
Single-cell RNA sequencing (scRNA-seq) technology has transformed gene expression studies by enabling analysis at the individual cell level, offering unprecedented insights into cellular heterogeneity. A key challenge in scRNA-seq data analysis is cell type identification, which requires grouping cells with similar gene expression profiles using unsupervised clustering methods. However, the high dimensionality, inherent noise, and significant sparsity of scRNA-seq data present substantial obstacles to accurately determining relationships among cell samples. To address these challenges, we propose a novel deep subspace clustering approach for cell type identification that captures a more reliable subspace structure from scRNA-seq data. Our method leverages a robust self-representation learning framework to effectively characterize and learn the underlying cluster structure. This framework is optimized through an integrated strategy combining a structure-guided approach with an optimal transport algorithm, enhancing the robustness of the subspace clustering process. By mitigating the effects of noise and sparsity in scRNA-seq data, this approach enables more accurate cell clustering. Experimental results on 18 real scRNA-seq datasets demonstrate that our method outperforms several state-of-the-art clustering approaches tailored for scRNA-seq data, excelling in both accuracy and interpretability.
{"title":"Robust subspace structure discovery for cell type identification in scRNA-seq data.","authors":"Xianyong Zhou, Xindian Wei, Cheng Liu, Wenjun Shen, Ping Xuan, Si Wu, Hau-San Wong","doi":"10.1186/s12859-025-06317-8","DOIUrl":"10.1186/s12859-025-06317-8","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) technology has transformed gene expression studies by enabling analysis at the individual cell level, offering unprecedented insights into cellular heterogeneity. A key challenge in scRNA-seq data analysis is cell type identification, which requires grouping cells with similar gene expression profiles using unsupervised clustering methods. However, the high dimensionality, inherent noise, and significant sparsity of scRNA-seq data present substantial obstacles to accurately determining relationships among cell samples. To address these challenges, we propose a novel deep subspace clustering approach for cell type identification that captures a more reliable subspace structure from scRNA-seq data. Our method leverages a robust self-representation learning framework to effectively characterize and learn the underlying cluster structure. This framework is optimized through an integrated strategy combining a structure-guided approach with an optimal transport algorithm, enhancing the robustness of the subspace clustering process. By mitigating the effects of noise and sparsity in scRNA-seq data, this approach enables more accurate cell clustering. Experimental results on 18 real scRNA-seq datasets demonstrate that our method outperforms several state-of-the-art clustering approaches tailored for scRNA-seq data, excelling in both accuracy and interpretability.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"295"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12752283/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s12859-025-06318-7
Junrong Song, Yuanli Gong, Zhiming Song, Xinggui Xu, Kun Qian, Yingbo Liu
Background: Cancer's complexity and heterogeneity pose significant challenges for personalized treatment. Accurate classification of patients into molecular subtypes is critical for targeted therapy and improved outcomes. However, existing methods often fail to simultaneously capture inter-patient heterogeneity and shared molecular patterns in driver gene profiles.
Results: To address this limitation, we propose DriverSub-SVM, a novel framework for interpretable cancer subtype classification that integrates patient-specific and cohort-wide driver gene information. Our method first models the bidirectional influence between mutated and dysregulated genes via a random walk on a functional interaction network. It then applies Bayesian Personalized Ranking (BPR) to infer personalized driver gene rankings for each patient. These rankings are aggregated into a consensus driver gene set using the Condorcet. Subsequently, a One-Against-One Multiclass Support Vector Machine (OAO-MSVM) classifies patients based on their gene-level profiles. Evaluated on multiple TCGA datasets, DriverSub-SVM outperformed four state-of-the-art methods, achieving higher accuracy and identifying clinically relevant genes associated with prognosis and therapeutic response.
Conclusion: DriverSub-SVM offers an effective and interpretable approach for cancer subtype classification by bridging individual heterogeneity and population-level patterns. It enhances understanding of tumor biology and holds promise for precision oncology and biomarker discovery. The source code is available at https://github.com/sjunrong/DriverSub-SVM .
{"title":"DriverSub-SVM: a machine learning approach for cancer subtype classification by integrating patient-specific and global driver genes.","authors":"Junrong Song, Yuanli Gong, Zhiming Song, Xinggui Xu, Kun Qian, Yingbo Liu","doi":"10.1186/s12859-025-06318-7","DOIUrl":"10.1186/s12859-025-06318-7","url":null,"abstract":"<p><strong>Background: </strong>Cancer's complexity and heterogeneity pose significant challenges for personalized treatment. Accurate classification of patients into molecular subtypes is critical for targeted therapy and improved outcomes. However, existing methods often fail to simultaneously capture inter-patient heterogeneity and shared molecular patterns in driver gene profiles.</p><p><strong>Results: </strong>To address this limitation, we propose DriverSub-SVM, a novel framework for interpretable cancer subtype classification that integrates patient-specific and cohort-wide driver gene information. Our method first models the bidirectional influence between mutated and dysregulated genes via a random walk on a functional interaction network. It then applies Bayesian Personalized Ranking (BPR) to infer personalized driver gene rankings for each patient. These rankings are aggregated into a consensus driver gene set using the Condorcet. Subsequently, a One-Against-One Multiclass Support Vector Machine (OAO-MSVM) classifies patients based on their gene-level profiles. Evaluated on multiple TCGA datasets, DriverSub-SVM outperformed four state-of-the-art methods, achieving higher accuracy and identifying clinically relevant genes associated with prognosis and therapeutic response.</p><p><strong>Conclusion: </strong>DriverSub-SVM offers an effective and interpretable approach for cancer subtype classification by bridging individual heterogeneity and population-level patterns. It enhances understanding of tumor biology and holds promise for precision oncology and biomarker discovery. The source code is available at https://github.com/sjunrong/DriverSub-SVM .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"297"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751776/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}