BMC Bioinformatics最新文献_第4页

CMsiRNAdb: a database of chemically modified SiRNA silencing efficiency for nucleic acid drug design. CMsiRNAdb：用于核酸药物设计的化学修饰SiRNA沉默效率数据库。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2026-01-03 DOI: 10.1186/s12859-025-06359-y

Sicheng He, Cheng Chen, Xianrun Pan, Gaogao Xue, Yu Yang, Juan Feng, Hasan Zulfiqar, Yang Zhang, Kejun Deng

Background: Small interfering RNA (siRNA) is a powerful tool for gene silencing, but its clinical application is limited by instability and potential immunogenicity. While chemical modification is essential to overcome these hurdles, data on chemically modified siRNAs are currently scattered, hindering rational drug design and development.

Results: We developed CMsiRNAdb, a comprehensive database integrating data resources, analytical tools, and efficacy prediction for chemically modified siRNAs. We consolidated 43,153 experimentally validated sequences and silencing efficiency data derived from 90 patents, covering 36 modification types and 13 therapeutic target genes. The database offers multi-dimensional retrieval, visualization, and batch download functions. Furthermore, we developed ModMapper, a Trie tree-based tool for precise identification of modification sites, and integrated the Cm-siRPred model for efficacy evaluation. CMsiRNAdb is freely accessible at https://cellknowledge.com.cn/CMsiRNAdb/ .

Conclusion: CMsiRNAdb provides critical data support and analytical tools for the rational design and rapid optimization of siRNA drugs. By standardizing data and offering predictive capabilities, it significantly advances the development of nucleic acid therapeutics.

背景：小干扰RNA （siRNA）是一种强大的基因沉默工具，但其临床应用受到不稳定性和潜在免疫原性的限制。虽然化学修饰对于克服这些障碍至关重要，但化学修饰sirna的数据目前是分散的，阻碍了合理的药物设计和开发。结果：我们开发了CMsiRNAdb，这是一个综合数据库，集成了数据资源、分析工具和化学修饰sirna的疗效预测。我们整合了来自90项专利的43153个经过实验验证的序列和沉默效率数据，涵盖36种修饰类型和13个治疗靶基因。该数据库提供多维检索、可视化和批量下载功能。此外，我们开发了ModMapper，一个基于Trie树的工具，用于精确识别修饰位点，并整合Cm-siRPred模型进行疗效评估。CMsiRNAdb可在https://cellknowledge.com.cn/CMsiRNAdb/.Conclusion免费获取：CMsiRNAdb为siRNA药物的合理设计和快速优化提供关键的数据支持和分析工具。通过标准化数据和提供预测能力，它显著推进了核酸疗法的发展。

{"title":"CMsiRNAdb: a database of chemically modified SiRNA silencing efficiency for nucleic acid drug design.","authors":"Sicheng He, Cheng Chen, Xianrun Pan, Gaogao Xue, Yu Yang, Juan Feng, Hasan Zulfiqar, Yang Zhang, Kejun Deng","doi":"10.1186/s12859-025-06359-y","DOIUrl":"10.1186/s12859-025-06359-y","url":null,"abstract":"Background: Small interfering RNA (siRNA) is a powerful tool for gene silencing, but its clinical application is limited by instability and potential immunogenicity. While chemical modification is essential to overcome these hurdles, data on chemically modified siRNAs are currently scattered, hindering rational drug design and development.Results: We developed CMsiRNAdb, a comprehensive database integrating data resources, analytical tools, and efficacy prediction for chemically modified siRNAs. We consolidated 43,153 experimentally validated sequences and silencing efficiency data derived from 90 patents, covering 36 modification types and 13 therapeutic target genes. The database offers multi-dimensional retrieval, visualization, and batch download functions. Furthermore, we developed ModMapper, a Trie tree-based tool for precise identification of modification sites, and integrated the Cm-siRPred model for efficacy evaluation. CMsiRNAdb is freely accessible at https://cellknowledge.com.cn/CMsiRNAdb/ .Conclusion: CMsiRNAdb provides critical data support and analytical tools for the rational design and rapid optimization of siRNA drugs. By standardizing data and offering predictive capabilities, it significantly advances the development of nucleic acid therapeutics.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"33"},"PeriodicalIF":3.3,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145896112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PLysPTM-HGNN: predicting lysine PTM sites of proteins using hybrid graph neural networks. PLysPTM-HGNN：利用混合图神经网络预测蛋白质赖氨酸PTM位点。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2026-01-02 DOI: 10.1186/s12859-025-06356-1

Lei Chen, Jingyu Yang, Bo Zhou, Yu-Dong Cai

引用次数: 0

DANSE: a pipeline for dynamic modelling of time-series multi-omics data. DANSE：时间序列多组学数据动态建模的管道。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2025-12-30 DOI: 10.1186/s12859-025-06354-3

Lucas F Jansen Klomp, Xinqi Yan, Rebecca R Snabel, Gert Jan C Veenstra, Hil G E Meijer, Janine N Post

引用次数: 0

PG-SCUnK: measuring pangenome graph representativeness using single-copy and universal K-mers. PG-SCUnK：使用单拷贝和通用K-mers测量泛基因组图代表性。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2025-12-30 DOI: 10.1186/s12859-025-06355-2

Tristan Cumer, Sotiria Milia, Alexander S Leonard, Hubert Pausch

Background: Pangenome graphs integrate multiple assemblies to represent non-redundant genetic diversity. However, current evaluations of pangenome graphs rely primarily on technical parameters (e.g., total length, number of nodes/edges, growth curves), which fail to assess how effectively the graph represents homologous stretches across the integrated assemblies and how well short reads align against pangenome graph references.

Results: We introduce a novel method to quantitatively assess how well a pangenome graph represents its integrated assemblies. Our method quantifies how many single-copy and universal k-mers from the source assemblies are uniquely and completely represented within the graph nodes. Implemented in the open-source tool PG-SCUnK, this approach identifies the fractions of unique, duplicated, and split k-mers, which correlate with short read mapping rates to the pangenome graph.

Conclusions: Insights provided by PG-SCUnK facilitate the selection of appropriate parameters to build optimal reference pangenome graphs.

背景：泛基因组图整合多个组件来表示非冗余遗传多样性。然而，目前对泛基因组图的评估主要依赖于技术参数（例如，总长度，节点/边的数量，生长曲线），这无法评估图在集成装配体中表示同源延伸的有效性，以及短读取与泛基因组图参考的对齐程度。结果：我们引入了一种新的方法来定量评估泛基因组图如何很好地代表其集成组装。我们的方法量化了有多少来自源程序集的单拷贝和通用k-mers在图节点中被唯一地和完全地表示。在开源工具PG-SCUnK中实现，该方法识别唯一、重复和分裂k-mers的部分，这些部分与泛基因组图的短读映射率相关。结论：PG-SCUnK提供的见解有助于选择合适的参数来构建最佳的泛基因组参考图谱。

{"title":"PG-SCUnK: measuring pangenome graph representativeness using single-copy and universal K-mers.","authors":"Tristan Cumer, Sotiria Milia, Alexander S Leonard, Hubert Pausch","doi":"10.1186/s12859-025-06355-2","DOIUrl":"10.1186/s12859-025-06355-2","url":null,"abstract":"Background: Pangenome graphs integrate multiple assemblies to represent non-redundant genetic diversity. However, current evaluations of pangenome graphs rely primarily on technical parameters (e.g., total length, number of nodes/edges, growth curves), which fail to assess how effectively the graph represents homologous stretches across the integrated assemblies and how well short reads align against pangenome graph references.Results: We introduce a novel method to quantitatively assess how well a pangenome graph represents its integrated assemblies. Our method quantifies how many single-copy and universal k-mers from the source assemblies are uniquely and completely represented within the graph nodes. Implemented in the open-source tool PG-SCUnK, this approach identifies the fractions of unique, duplicated, and split k-mers, which correlate with short read mapping rates to the pangenome graph.Conclusions: Insights provided by PG-SCUnK facilitate the selection of appropriate parameters to build optimal reference pangenome graphs.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"29"},"PeriodicalIF":3.3,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12859900/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145861880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DCPR: a deep learning framework for circadian phase reconstruction. DCPR：用于昼夜节律阶段重建的深度学习框架。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2025-12-30 DOI: 10.1186/s12859-025-06363-2

Xiao Han, Xiaochen Cen, Zhijin Li, Xiaobo Zhou, Zhiwei Ji

Background: The circadian clock is an evolutionarily conserved system that orchestrates 24-h physiological rhythms through transcriptional and translational feedback loops. Mounting evidence suggests a bidirectional relationship between circadian rhythm alteration and disease progression, positioning the circadian clock as a potential therapeutic target. Due to the scarcity of high-resolution temporal omics data, it remains very challenging to elucidate the underlying regulatory mechanisms of the circadian system. As a practical alternative, public untimed transcriptomic datasets offer the potential to infer gene expression oscillations retrospectively. However, existing computational approaches for circadian phase estimation often suffer from limited predictive accuracy, reducing their ability to reliably reconstruct rhythmic gene expression patterns.

Results: To overcome these limitations, we develop DCPR, an unsupervised deep learning framework designed to accurately reconstruct the circadian phase from untimed transcriptomic data. Through comprehensive analyses of both simulated and real data, DCPR consistently overperforms existing methods in circadian phase estimation. Additional validations using knowledgebase mining and ex vivo experimental data further support DCPR's efficacy in reconstructing the oscillatory pattern of gene expression and detecting circadian variation.

Conclusions: Our study demonstrates that DCPR is a highly versatile tool for systematically identifying transcriptional rhythms from untimed expression data. This tool will facilitate therapeutics discovery for circadian-related behavioral and pathological disorders.

背景：生物钟是一个进化上保守的系统，通过转录和翻译反馈循环协调24小时的生理节律。越来越多的证据表明昼夜节律改变与疾病进展之间存在双向关系，将生物钟定位为潜在的治疗靶点。由于缺乏高分辨率的时间组学数据，阐明昼夜节律系统的潜在调节机制仍然非常具有挑战性。作为一个实际的替代方案，公开的非定时转录组数据集提供了回顾性推断基因表达振荡的潜力。然而，现有的昼夜节律相位估计计算方法往往存在预测精度有限的问题，从而降低了它们可靠地重建节律性基因表达模式的能力。结果：为了克服这些限制，我们开发了DCPR，这是一种无监督深度学习框架，旨在从非定时转录组数据中准确重建昼夜节律阶段。通过对模拟和真实数据的综合分析，DCPR在昼夜节律相位估计方面始终优于现有方法。利用知识库挖掘和离体实验数据进一步验证了DCPR在重建基因表达振荡模式和检测昼夜节律变化方面的有效性。结论：我们的研究表明，DCPR是一种高度通用的工具，可以从非定时表达数据中系统地识别转录节律。该工具将促进昼夜节律相关的行为和病理障碍的治疗方法的发现。

{"title":"DCPR: a deep learning framework for circadian phase reconstruction.","authors":"Xiao Han, Xiaochen Cen, Zhijin Li, Xiaobo Zhou, Zhiwei Ji","doi":"10.1186/s12859-025-06363-2","DOIUrl":"10.1186/s12859-025-06363-2","url":null,"abstract":"Background: The circadian clock is an evolutionarily conserved system that orchestrates 24-h physiological rhythms through transcriptional and translational feedback loops. Mounting evidence suggests a bidirectional relationship between circadian rhythm alteration and disease progression, positioning the circadian clock as a potential therapeutic target. Due to the scarcity of high-resolution temporal omics data, it remains very challenging to elucidate the underlying regulatory mechanisms of the circadian system. As a practical alternative, public untimed transcriptomic datasets offer the potential to infer gene expression oscillations retrospectively. However, existing computational approaches for circadian phase estimation often suffer from limited predictive accuracy, reducing their ability to reliably reconstruct rhythmic gene expression patterns.Results: To overcome these limitations, we develop DCPR, an unsupervised deep learning framework designed to accurately reconstruct the circadian phase from untimed transcriptomic data. Through comprehensive analyses of both simulated and real data, DCPR consistently overperforms existing methods in circadian phase estimation. Additional validations using knowledgebase mining and ex vivo experimental data further support DCPR's efficacy in reconstructing the oscillatory pattern of gene expression and detecting circadian variation.Conclusions: Our study demonstrates that DCPR is a highly versatile tool for systematically identifying transcriptional rhythms from untimed expression data. This tool will facilitate therapeutics discovery for circadian-related behavioral and pathological disorders.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"31"},"PeriodicalIF":3.3,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145861843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A directed weighted network-based method for drug combinations identification using drug-target and inter-target regulation. 一种基于有向加权网络的药物组合识别方法，利用药物靶标和靶标间调节。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2025-12-29 DOI: 10.1186/s12859-025-06321-y

Shen Xiao, Yuhang Li, Jinwei Bai, Zhenhua Shen, Can Huang, Rongwu Xiang, Yuxuan Zhai, Xiwei Jiang

Background: Drug combination is currently a promising solution in treating complex diseases due to its reducing toxicity and enhancing therapeutic efficacy. However, the accurate identification of drug combination effects remains challenging.

Results: In this work, we propose a novel directed weighted network-based approach to identify drug combinations. Specifically, the network is constructed on both drug-target and inter-target interactions, together with their directed regulation. The biological processes of drug effects propagation and attenuation are modeled, aiming to capture direct and indirect drug actions on targets. By assigning weights to nodes of regulatory effects, relative distances between node sets within network can thus be computed. These distances are then analyzed to discriminate the combinatorial efficacy of various drug combinations. Empirical evaluations validate a remarkable working performance of the proposed method. Compared to existing approaches, our method is a better alternative on the task of drug combination prediction.

Conclusion: The proposed method reports a creative and practical scheme for identifying drug combination effects. With the analysis of drug-target and inter-target regulatory relation, our method is more competitive in distinguishing the combinatorial efficacy, which mitigates the deficiencies of classical drug combination prediction models.

背景：药物联合治疗具有降低毒性和提高疗效的优点，是目前治疗复杂疾病的一种很有前途的方法。然而，准确识别药物联合效应仍然具有挑战性。结果：在这项工作中，我们提出了一种新的基于定向加权网络的方法来识别药物组合。具体来说，该网络是建立在药物-靶点和靶点间相互作用以及它们的定向调节之上的。对药物效应传播和衰减的生物学过程进行建模，旨在捕捉药物对靶标的直接和间接作用。通过对具有调节效应的节点分配权重，可以计算出网络内节点集之间的相对距离。然后对这些距离进行分析，以区分各种药物组合的组合功效。实证评估验证了该方法的显著工作性能。与现有的方法相比，我们的方法在药物联合预测任务上是一个更好的选择。结论：本方法为药物联合效应鉴别提供了一种新颖实用的方法。通过对药物-靶点及靶点间调控关系的分析，该方法在区分组合疗效方面更具竞争力，弥补了经典药物联合预测模型的不足。

{"title":"A directed weighted network-based method for drug combinations identification using drug-target and inter-target regulation.","authors":"Shen Xiao, Yuhang Li, Jinwei Bai, Zhenhua Shen, Can Huang, Rongwu Xiang, Yuxuan Zhai, Xiwei Jiang","doi":"10.1186/s12859-025-06321-y","DOIUrl":"10.1186/s12859-025-06321-y","url":null,"abstract":"Background: Drug combination is currently a promising solution in treating complex diseases due to its reducing toxicity and enhancing therapeutic efficacy. However, the accurate identification of drug combination effects remains challenging.Results: In this work, we propose a novel directed weighted network-based approach to identify drug combinations. Specifically, the network is constructed on both drug-target and inter-target interactions, together with their directed regulation. The biological processes of drug effects propagation and attenuation are modeled, aiming to capture direct and indirect drug actions on targets. By assigning weights to nodes of regulatory effects, relative distances between node sets within network can thus be computed. These distances are then analyzed to discriminate the combinatorial efficacy of various drug combinations. Empirical evaluations validate a remarkable working performance of the proposed method. Compared to existing approaches, our method is a better alternative on the task of drug combination prediction.Conclusion: The proposed method reports a creative and practical scheme for identifying drug combination effects. With the analysis of drug-target and inter-target regulatory relation, our method is more competitive in distinguishing the combinatorial efficacy, which mitigates the deficiencies of classical drug combination prediction models.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"299"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751882/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving protein interaction prediction in GenPPi: a novel interaction sampling approach preserving network topology. 改进GenPPi中的蛋白质相互作用预测：一种保持网络拓扑结构的新型相互作用采样方法。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2025-12-29 DOI: 10.1186/s12859-025-06325-8

Alisson Silva, Carlos Marquez, Iury Godoy, Lucas Silva, Matheus Prado, Murilo Beppler, Natanael Avila, Bruno Travençolo, Anderson R Santos

Background: Computational prediction of protein-protein interactions (PPIs) is crucial for understanding cell biology and drug development, offering an alternative to costly experimental methods. The original GenPPi software advanced ab initio PPI network prediction from bacterial genomes but was limited by its reliance on high sequence similarity. This work introduces GenPPi 1.5 to enhance these predictive capabilities.

Results: GenPPi 1.5 incorporates a Random Forest (RF) algorithm, trained on 60 biophysical features from amino acid propensity indices, to classify protein similarity even in low sequence identity scenarios (targeting >65% identity). To manage computational complexity from the increased interactions generated by the RF model, especially in extensive conserved phylogenetic profiles, we developed and integrated the Reduced Interaction Sampling (RIS) algorithm. RIS stochastically samples interactions within these profiles, optimizing performance for complete genome analysis. Extensive simulations across various configurations validated the methodology. RF integration significantly broadened GenPPi's predictive power; application to Buchnera aphidicola showed up to 62% overlap with STRING database interactions. Analysis of RIS demonstrated that while introducing some randomness, critical node identification remains robust, particularly for Top_N values ≥ 100, indicating minimal compromise to network integrity.

Conclusion: The combination of Machine Learning (RF) and the RIS algorithm in GenPPi 1.5 represents a significant advancement. It overcomes the high-similarity dependency of the previous version while efficiently handling complex genomes. GenPPi 1.5 provides a robust and scalable alignment-free PPI prediction solution, enabling users to train custom models tailored to specific genomic contexts. GenPPi is freely available on our website https://genppi.facom.ufu.br/ , its source code is hosted on GitHub https://github.com/santosardr/genppi , and it can be easily installed via the Python Package Index using the command pip install genppi-py.

背景：蛋白质-蛋白质相互作用（PPIs）的计算预测对于理解细胞生物学和药物开发至关重要，为昂贵的实验方法提供了一种替代方法。最初的GenPPi软件从细菌基因组中进行从头算PPI网络预测，但由于依赖于高序列相似性而受到限制。本工作引入了GenPPi 1.5来增强这些预测能力。结果：GenPPi 1.5采用随机森林（Random Forest， RF）算法，对来自氨基酸倾向指数的60种生物物理特征进行训练，即使在低序列同一性情况下（目标为>65%同一性），也能对蛋白质相似性进行分类。为了管理由RF模型产生的增加的相互作用带来的计算复杂性，特别是在广泛保守的系统发育剖面中，我们开发并集成了减少相互作用采样（RIS）算法。RIS随机取样这些谱中的相互作用，优化全基因组分析的性能。各种配置的大量模拟验证了该方法。射频集成显著提高了GenPPi的预测能力；应用程序显示，与STRING数据库交互的重叠高达62%。RIS的分析表明，虽然引入了一些随机性，但关键节点识别仍然是稳健的，特别是当Top_N值≥100时，这表明对网络完整性的损害最小。结论：GenPPi 1.5中机器学习（RF）与RIS算法的结合是一个显著的进步。它克服了先前版本的高度相似性依赖，同时有效地处理复杂的基因组。GenPPi 1.5提供了一个强大且可扩展的无对齐PPI预测解决方案，使用户能够根据特定的基因组背景训练定制模型。GenPPi可以在我们的网站https://genppi.facom.ufu.br/上免费获得，其源代码托管在GitHub https://github.com/santosardr/genppi上，并且可以使用pip install GenPPi -py命令通过Python Package Index轻松安装。

{"title":"Improving protein interaction prediction in GenPPi: a novel interaction sampling approach preserving network topology.","authors":"Alisson Silva, Carlos Marquez, Iury Godoy, Lucas Silva, Matheus Prado, Murilo Beppler, Natanael Avila, Bruno Travençolo, Anderson R Santos","doi":"10.1186/s12859-025-06325-8","DOIUrl":"10.1186/s12859-025-06325-8","url":null,"abstract":"Background: Computational prediction of protein-protein interactions (PPIs) is crucial for understanding cell biology and drug development, offering an alternative to costly experimental methods. The original GenPPi software advanced ab initio PPI network prediction from bacterial genomes but was limited by its reliance on high sequence similarity. This work introduces GenPPi 1.5 to enhance these predictive capabilities.Results: GenPPi 1.5 incorporates a Random Forest (RF) algorithm, trained on 60 biophysical features from amino acid propensity indices, to classify protein similarity even in low sequence identity scenarios (targeting >65% identity). To manage computational complexity from the increased interactions generated by the RF model, especially in extensive conserved phylogenetic profiles, we developed and integrated the Reduced Interaction Sampling (RIS) algorithm. RIS stochastically samples interactions within these profiles, optimizing performance for complete genome analysis. Extensive simulations across various configurations validated the methodology. RF integration significantly broadened GenPPi's predictive power; application to Buchnera aphidicola showed up to 62% overlap with STRING database interactions. Analysis of RIS demonstrated that while introducing some randomness, critical node identification remains robust, particularly for Top_N values ≥ 100, indicating minimal compromise to network integrity.Conclusion: The combination of Machine Learning (RF) and the RIS algorithm in GenPPi 1.5 represents a significant advancement. It overcomes the high-similarity dependency of the previous version while efficiently handling complex genomes. GenPPi 1.5 provides a robust and scalable alignment-free PPI prediction solution, enabling users to train custom models tailored to specific genomic contexts. GenPPi is freely available on our website https://genppi.facom.ufu.br/ , its source code is hosted on GitHub https://github.com/santosardr/genppi , and it can be easily installed via the Python Package Index using the command pip install genppi-py.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"296"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751606/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prodigy protein: Python package for zero-shot protein engineering using protein language models. Prodigy protein: Python包，用于使用蛋白质语言模型的零射击蛋白质工程。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2025-12-29 DOI: 10.1186/s12859-025-06316-9

Matthew Massett, Adrian Carr

Background: Protein Language Models (PLMs) are emerging as powerful tools for designing human proteins, including antibodies. These models can predict the effects of mutations in a zero-shot setting-without requiring additional fine-tuning-and suggest plausible amino acid substitutions.

Results: We introduce Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein which provides several DirectedEvolution classes that introduce amino acid substitutions in a stepwise manner. Each substitution is evaluated using one of two scoring strategies, and the most promising candidates are sampled accordingly. Users can customize the number of evolution steps, specify target regions within the protein sequence, and set score thresholds to filter out low-quality substitutions during the design process.

Conclusion: Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein is a fast and flexible tool for in silico protein design. It introduces a consistent and efficient probabilistic framework that leverages any masked language modeling Protein Language Model (PLM) available via Hugging Face. Unlike existing tools, Prodigy Protein can integrate multiple PLMs to design protein variants-an approach not currently supported by other publicly available software.

背景：蛋白质语言模型（PLMs）正在成为设计人类蛋白质（包括抗体）的强大工具。这些模型可以在零突变的情况下预测突变的影响——不需要额外的微调——并建议合理的氨基酸替代。结果：我们通过产生的突变（Prodigy）蛋白引入了蛋白质多样化和生成，该蛋白提供了几个定向进化类，以逐步的方式引入氨基酸取代。每个替换都使用两种评分策略中的一种进行评估，并相应地对最有希望的候选对象进行抽样。用户可以自定义进化步骤的数量，在蛋白质序列中指定目标区域，并设置评分阈值，以过滤掉设计过程中低质量的替代。结论：Prodigy蛋白是一种快速、灵活的硅蛋白设计工具。它引入了一个一致和有效的概率框架，利用任何屏蔽语言建模蛋白质语言模型（PLM）通过拥抱脸可用。与现有的工具不同，Prodigy Protein可以集成多个plm来设计蛋白质变体，这是目前其他公开软件不支持的一种方法。

{"title":"Prodigy protein: Python package for zero-shot protein engineering using protein language models.","authors":"Matthew Massett, Adrian Carr","doi":"10.1186/s12859-025-06316-9","DOIUrl":"10.1186/s12859-025-06316-9","url":null,"abstract":"Background: Protein Language Models (PLMs) are emerging as powerful tools for designing human proteins, including antibodies. These models can predict the effects of mutations in a zero-shot setting-without requiring additional fine-tuning-and suggest plausible amino acid substitutions.Results: We introduce Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein which provides several DirectedEvolution classes that introduce amino acid substitutions in a stepwise manner. Each substitution is evaluated using one of two scoring strategies, and the most promising candidates are sampled accordingly. Users can customize the number of evolution steps, specify target regions within the protein sequence, and set score thresholds to filter out low-quality substitutions during the design process.Conclusion: Protein Diversification and Generation through Yielded Mutations (Prodigy) Protein is a fast and flexible tool for in silico protein design. It introduces a consistent and efficient probabilistic framework that leverages any masked language modeling Protein Language Model (PLM) available via Hugging Face. Unlike existing tools, Prodigy Protein can integrate multiple PLMs to design protein variants-an approach not currently supported by other publicly available software.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"298"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust subspace structure discovery for cell type identification in scRNA-seq data. 基于scRNA-seq数据的细胞类型识别的稳健子空间结构发现。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2025-12-29 DOI: 10.1186/s12859-025-06317-8

Xianyong Zhou, Xindian Wei, Cheng Liu, Wenjun Shen, Ping Xuan, Si Wu, Hau-San Wong

Single-cell RNA sequencing (scRNA-seq) technology has transformed gene expression studies by enabling analysis at the individual cell level, offering unprecedented insights into cellular heterogeneity. A key challenge in scRNA-seq data analysis is cell type identification, which requires grouping cells with similar gene expression profiles using unsupervised clustering methods. However, the high dimensionality, inherent noise, and significant sparsity of scRNA-seq data present substantial obstacles to accurately determining relationships among cell samples. To address these challenges, we propose a novel deep subspace clustering approach for cell type identification that captures a more reliable subspace structure from scRNA-seq data. Our method leverages a robust self-representation learning framework to effectively characterize and learn the underlying cluster structure. This framework is optimized through an integrated strategy combining a structure-guided approach with an optimal transport algorithm, enhancing the robustness of the subspace clustering process. By mitigating the effects of noise and sparsity in scRNA-seq data, this approach enables more accurate cell clustering. Experimental results on 18 real scRNA-seq datasets demonstrate that our method outperforms several state-of-the-art clustering approaches tailored for scRNA-seq data, excelling in both accuracy and interpretability.

单细胞RNA测序（scRNA-seq）技术通过在单个细胞水平上进行分析，改变了基因表达研究，为细胞异质性提供了前所未有的见解。scRNA-seq数据分析的一个关键挑战是细胞类型鉴定，这需要使用无监督聚类方法对具有相似基因表达谱的细胞进行分组。然而，scRNA-seq数据的高维性、固有的噪声和显著的稀疏性给准确确定细胞样本之间的关系带来了实质性的障碍。为了解决这些挑战，我们提出了一种新的深度子空间聚类方法，用于细胞类型鉴定，从scRNA-seq数据中捕获更可靠的子空间结构。我们的方法利用一个鲁棒的自表示学习框架来有效地表征和学习底层集群结构。该框架通过结合结构导向方法和最优传输算法的集成策略进行优化，增强了子空间聚类过程的鲁棒性。通过减轻scRNA-seq数据中的噪声和稀疏性的影响，该方法可以实现更准确的细胞聚类。在18个真实的scRNA-seq数据集上的实验结果表明，我们的方法优于为scRNA-seq数据定制的几种最先进的聚类方法，在准确性和可解释性方面都表现出色。

{"title":"Robust subspace structure discovery for cell type identification in scRNA-seq data.","authors":"Xianyong Zhou, Xindian Wei, Cheng Liu, Wenjun Shen, Ping Xuan, Si Wu, Hau-San Wong","doi":"10.1186/s12859-025-06317-8","DOIUrl":"10.1186/s12859-025-06317-8","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) technology has transformed gene expression studies by enabling analysis at the individual cell level, offering unprecedented insights into cellular heterogeneity. A key challenge in scRNA-seq data analysis is cell type identification, which requires grouping cells with similar gene expression profiles using unsupervised clustering methods. However, the high dimensionality, inherent noise, and significant sparsity of scRNA-seq data present substantial obstacles to accurately determining relationships among cell samples. To address these challenges, we propose a novel deep subspace clustering approach for cell type identification that captures a more reliable subspace structure from scRNA-seq data. Our method leverages a robust self-representation learning framework to effectively characterize and learn the underlying cluster structure. This framework is optimized through an integrated strategy combining a structure-guided approach with an optimal transport algorithm, enhancing the robustness of the subspace clustering process. By mitigating the effects of noise and sparsity in scRNA-seq data, this approach enables more accurate cell clustering. Experimental results on 18 real scRNA-seq datasets demonstrate that our method outperforms several state-of-the-art clustering approaches tailored for scRNA-seq data, excelling in both accuracy and interpretability.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"295"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12752283/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DriverSub-SVM: a machine learning approach for cancer subtype classification by integrating patient-specific and global driver genes. DriverSub-SVM：一种通过整合患者特异性和全局驱动基因进行癌症亚型分类的机器学习方法。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2025-12-29 DOI: 10.1186/s12859-025-06318-7

Junrong Song, Yuanli Gong, Zhiming Song, Xinggui Xu, Kun Qian, Yingbo Liu

Background: Cancer's complexity and heterogeneity pose significant challenges for personalized treatment. Accurate classification of patients into molecular subtypes is critical for targeted therapy and improved outcomes. However, existing methods often fail to simultaneously capture inter-patient heterogeneity and shared molecular patterns in driver gene profiles.

Results: To address this limitation, we propose DriverSub-SVM, a novel framework for interpretable cancer subtype classification that integrates patient-specific and cohort-wide driver gene information. Our method first models the bidirectional influence between mutated and dysregulated genes via a random walk on a functional interaction network. It then applies Bayesian Personalized Ranking (BPR) to infer personalized driver gene rankings for each patient. These rankings are aggregated into a consensus driver gene set using the Condorcet. Subsequently, a One-Against-One Multiclass Support Vector Machine (OAO-MSVM) classifies patients based on their gene-level profiles. Evaluated on multiple TCGA datasets, DriverSub-SVM outperformed four state-of-the-art methods, achieving higher accuracy and identifying clinically relevant genes associated with prognosis and therapeutic response.

Conclusion: DriverSub-SVM offers an effective and interpretable approach for cancer subtype classification by bridging individual heterogeneity and population-level patterns. It enhances understanding of tumor biology and holds promise for precision oncology and biomarker discovery. The source code is available at https://github.com/sjunrong/DriverSub-SVM .

背景：癌症的复杂性和异质性对个性化治疗提出了重大挑战。准确地将患者分类为分子亚型对于靶向治疗和改善结果至关重要。然而，现有的方法往往不能同时捕获患者间异质性和驱动基因谱中的共享分子模式。结果：为了解决这一限制，我们提出了DriverSub-SVM，这是一个可解释的癌症亚型分类的新框架，整合了患者特异性和队列范围的驱动基因信息。我们的方法首先通过在功能相互作用网络上的随机漫步来模拟突变和失调基因之间的双向影响。然后应用贝叶斯个性化排名（BPR）来推断每个患者的个性化驱动基因排名。这些排名被汇总成一个共识驱动基因集使用孔多塞。随后，一对一多类支持向量机（OAO-MSVM）根据患者的基因水平谱对其进行分类。在多个TCGA数据集上进行评估，DriverSub-SVM优于四种最先进的方法，实现了更高的准确性，并识别出与预后和治疗反应相关的临床相关基因。结论：DriverSub-SVM通过连接个体异质性和人群水平模式，为癌症亚型分类提供了一种有效且可解释的方法。它提高了对肿瘤生物学的理解，并为精确肿瘤学和生物标志物的发现带来了希望。源代码可从https://github.com/sjunrong/DriverSub-SVM获得。

{"title":"DriverSub-SVM: a machine learning approach for cancer subtype classification by integrating patient-specific and global driver genes.","authors":"Junrong Song, Yuanli Gong, Zhiming Song, Xinggui Xu, Kun Qian, Yingbo Liu","doi":"10.1186/s12859-025-06318-7","DOIUrl":"10.1186/s12859-025-06318-7","url":null,"abstract":"Background: Cancer's complexity and heterogeneity pose significant challenges for personalized treatment. Accurate classification of patients into molecular subtypes is critical for targeted therapy and improved outcomes. However, existing methods often fail to simultaneously capture inter-patient heterogeneity and shared molecular patterns in driver gene profiles.Results: To address this limitation, we propose DriverSub-SVM, a novel framework for interpretable cancer subtype classification that integrates patient-specific and cohort-wide driver gene information. Our method first models the bidirectional influence between mutated and dysregulated genes via a random walk on a functional interaction network. It then applies Bayesian Personalized Ranking (BPR) to infer personalized driver gene rankings for each patient. These rankings are aggregated into a consensus driver gene set using the Condorcet. Subsequently, a One-Against-One Multiclass Support Vector Machine (OAO-MSVM) classifies patients based on their gene-level profiles. Evaluated on multiple TCGA datasets, DriverSub-SVM outperformed four state-of-the-art methods, achieving higher accuracy and identifying clinically relevant genes associated with prognosis and therapeutic response.Conclusion: DriverSub-SVM offers an effective and interpretable approach for cancer subtype classification by bridging individual heterogeneity and population-level patterns. It enhances understanding of tumor biology and holds promise for precision oncology and biomarker discovery. The source code is available at https://github.com/sjunrong/DriverSub-SVM .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"297"},"PeriodicalIF":3.3,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751776/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145854201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0