Briefings in bioinformatics最新文献_第4页

DL-GapFilling: a novel deep learning framework for improved plant genome gap filling. dl - gap填充：一种新的深度学习框架，用于改进植物基因组间隙填充。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag007

Yu Chen, Zihao Wang, Gang Wang, Guohua Wang

Genome assembly has been a cornerstone of bioinformatics for decades, with faster and more accurate assembly of unknown genomes remaining a critical challenge. However, genome diversity, structural variations, insufficient sequencing depth, and limitations of current algorithms often lead to numerous gaps during assembly, hindering the construction of high-quality reference genomes. While various assembly methods and software tools have been developed, most exhibit low efficiency in gap filling and fail to account for the intrinsic structural properties of genomic sequences. Here, we present DL-GapFilling, a deep learning-based framework for genome assembly and gap filling. DL-GapFilling leverages a novel Deep Filling Neural Network model to efficiently extract and contextualize flanking sequence information, and incorporates the BeamStar contraction-expand algorithm, which integrates a redefined cost function, an enhanced search strategy, and genomic structural priors to improve both generalization and efficiency in gap filling. In addition, a PredictionFilter mechanism is introduced to selectively retain high-confidence predictions, mitigating the impact of poorly predicted sequences on assembly quality. Experimental results demonstrate that DL-GapFilling significantly improves gap-filling performance across multiple plant or algal genome datasets, achieving increases of 15.6%, 6.1%, 16.7%, 5.5%, and 23.5% in the number of gaps filled compared to traditional tools, and outperforming existing DL-based methods in both efficiency and accuracy. These findings underscore the potential of DL-GapFilling as a powerful tool for advancing genome assembly research.

几十年来，基因组组装一直是生物信息学的基石，更快、更准确地组装未知基因组仍然是一个关键的挑战。然而，基因组多样性、结构变异、测序深度不足以及现有算法的局限性，往往导致组装过程中出现大量空白，阻碍了高质量参考基因组的构建。虽然已经开发了各种组装方法和软件工具，但大多数方法在间隙填充方面效率较低，并且无法考虑基因组序列的内在结构特性。在这里，我们提出了DL-GapFilling，这是一个基于深度学习的基因组组装和间隙填充框架。dl - gap填充利用一种新颖的深度填充神经网络模型来有效地提取和上下文化侧翼序列信息，并结合了BeamStar收缩-扩展算法，该算法集成了重新定义的成本函数、增强的搜索策略和基因组结构先验，以提高间隙填充的泛化和效率。此外，引入了PredictionFilter机制来选择性地保留高置信度的预测，减轻预测不佳的序列对装配质量的影响。实验结果表明，DL-GapFilling显著提高了多个植物或藻类基因组数据集的空白填充性能，与传统工具相比，填补的空白数量分别增加了15.6%、6.1%、16.7%、5.5%和23.5%，在效率和准确性方面均优于现有的基于dl的方法。这些发现强调了dl - gap填充作为推进基因组组装研究的有力工具的潜力。

{"title":"DL-GapFilling: a novel deep learning framework for improved plant genome gap filling.","authors":"Yu Chen, Zihao Wang, Gang Wang, Guohua Wang","doi":"10.1093/bib/bbag007","DOIUrl":"10.1093/bib/bbag007","url":null,"abstract":"Genome assembly has been a cornerstone of bioinformatics for decades, with faster and more accurate assembly of unknown genomes remaining a critical challenge. However, genome diversity, structural variations, insufficient sequencing depth, and limitations of current algorithms often lead to numerous gaps during assembly, hindering the construction of high-quality reference genomes. While various assembly methods and software tools have been developed, most exhibit low efficiency in gap filling and fail to account for the intrinsic structural properties of genomic sequences. Here, we present DL-GapFilling, a deep learning-based framework for genome assembly and gap filling. DL-GapFilling leverages a novel Deep Filling Neural Network model to efficiently extract and contextualize flanking sequence information, and incorporates the BeamStar contraction-expand algorithm, which integrates a redefined cost function, an enhanced search strategy, and genomic structural priors to improve both generalization and efficiency in gap filling. In addition, a PredictionFilter mechanism is introduced to selectively retain high-confidence predictions, mitigating the impact of poorly predicted sequences on assembly quality. Experimental results demonstrate that DL-GapFilling significantly improves gap-filling performance across multiple plant or algal genome datasets, achieving increases of 15.6%, 6.1%, 16.7%, 5.5%, and 23.5% in the number of gaps filled compared to traditional tools, and outperforming existing DL-based methods in both efficiency and accuracy. These findings underscore the potential of DL-GapFilling as a powerful tool for advancing genome assembly research.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146050282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A comprehensive survey of genome language models in bioinformatics. 生物信息学中基因组语言模型的综合综述。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf724

Liyuan Shu, Jiao Tang, Xiaoyu Guan, Daoqiang Zhang

Large language models have revolutionized natural language processing by effectively modeling complex semantics and capturing long-range contextual relationships. Inspired by these advancements, genome language models (gLMs) have recently emerged, conceptualizing DNA and RNA sequences as biological texts and enabling the identification of intricate genomic grammar and distant regulatory interactions. This review examines the need for gLMs, emphasizing their capacity to overcome the limitations of traditional deep learning approaches in genomic sequence characterization. We comprehensively survey contemporary gLM architectures, including Transformer models, Hyena convolutions, and state space models, as well as various sequence tokenization strategies, assessing their applicability, and effectiveness across diverse genomic applications. Additionally, we discuss foundational pretraining strategies and provide an overview of genomic pretraining datasets spanning multiple species and functional domains. We critically analyze evaluation methodologies, including supervised, zero-shot, and few-shot learning paradigms, as well as fine-tuning approaches. An extensive taxonomy of downstream tasks is presented, alongside a summary of existing benchmarks and emerging trends. Finally, we contemplate key challenges such as data scarcity, interpretability, and the computational demands of genomic modeling, and propose a roadmap to guide future advances in genome language modeling.

大型语言模型通过有效地建模复杂语义和捕获远程上下文关系，彻底改变了自然语言处理。受这些进步的启发，基因组语言模型（gLMs）最近出现，将DNA和RNA序列概念化为生物学文本，并使复杂的基因组语法和远程调控相互作用得以识别。这篇综述探讨了对glm的需求，强调了它们在基因组序列表征中克服传统深度学习方法局限性的能力。我们全面调查了当代的gLM架构，包括Transformer模型、鬣狗卷积和状态空间模型，以及各种序列标记化策略，评估了它们在不同基因组应用中的适用性和有效性。此外，我们还讨论了基本的预训练策略，并提供了跨多个物种和功能域的基因组预训练数据集的概述。我们批判性地分析评估方法，包括监督、零试和少试学习范式，以及微调方法。介绍了下游任务的广泛分类，以及现有基准和新趋势的摘要。最后，我们展望了基因组建模的关键挑战，如数据稀缺性、可解释性和计算需求，并提出了指导基因组语言建模未来发展的路线图。

{"title":"A comprehensive survey of genome language models in bioinformatics.","authors":"Liyuan Shu, Jiao Tang, Xiaoyu Guan, Daoqiang Zhang","doi":"10.1093/bib/bbaf724","DOIUrl":"10.1093/bib/bbaf724","url":null,"abstract":"Large language models have revolutionized natural language processing by effectively modeling complex semantics and capturing long-range contextual relationships. Inspired by these advancements, genome language models (gLMs) have recently emerged, conceptualizing DNA and RNA sequences as biological texts and enabling the identification of intricate genomic grammar and distant regulatory interactions. This review examines the need for gLMs, emphasizing their capacity to overcome the limitations of traditional deep learning approaches in genomic sequence characterization. We comprehensively survey contemporary gLM architectures, including Transformer models, Hyena convolutions, and state space models, as well as various sequence tokenization strategies, assessing their applicability, and effectiveness across diverse genomic applications. Additionally, we discuss foundational pretraining strategies and provide an overview of genomic pretraining datasets spanning multiple species and functional domains. We critically analyze evaluation methodologies, including supervised, zero-shot, and few-shot learning paradigms, as well as fine-tuning approaches. An extensive taxonomy of downstream tasks is presented, alongside a summary of existing benchmarks and emerging trends. Finally, we contemplate key challenges such as data scarcity, interpretability, and the computational demands of genomic modeling, and propose a roadmap to guide future advances in genome language modeling.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

scGACL: a generative adversarial network with multi-scale contrastive learning for accurate single-cell RNA sequencing imputation. scGACL：一个具有多尺度对比学习的生成对抗网络，用于精确的单细胞RNA测序植入。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag018

Yanlin Jiang, Mengyuan Zhao, Jiahui Yan, Jijun Tang, Fei Guo

Single-cell RNA sequencing is a powerful technology for investigating cell-to-cell heterogeneity, yet its application is often hindered by dropout events, making accurate imputation essential for downstream analyses. Existing imputation methods, however, frequently suffer from the over-smoothing problem, which results in the loss of cell-to-cell heterogeneity in the imputed outcomes and affects downstream analyses. To overcome this limitation, we propose scGACL, a generative adversarial network (GAN) integrated with multi-scale contrastive learning. The GAN architecture facilitates the distribution of the imputed data to approximate that of the real data. To fundamentally address over-smoothing, the model incorporates a multi-scale contrastive learning mechanism: cell-level contrastive learning preserves fine-grained cell-to-cell heterogeneity, while cell-type-level contrastive learning maintains macroscopic biological variation across different cellular groups. These mechanisms function synergistically to ensure accurate imputation and effectively address the over-smoothing challenge. Comprehensive evaluations across diverse simulated and real-world datasets confirm that scGACL consistently outperforms existing methods in accurately recovering gene expression and improving downstream analyses such as cell clustering, gene differential expression analysis, and cell trajectory inference.

单细胞RNA测序是研究细胞间异质性的一项强大技术，但其应用经常受到辍学事件的阻碍，这使得准确的植入对下游分析至关重要。然而，现有的归算方法经常存在过度平滑问题，这导致在归算结果中失去细胞间的异质性，并影响下游分析。为了克服这一限制，我们提出了scGACL，一种集成了多尺度对比学习的生成对抗网络（GAN）。GAN结构使得输入数据的分布更接近真实数据的分布。为了从根本上解决过度平滑问题，该模型采用了一种多尺度对比学习机制：细胞水平的对比学习保留了细粒度的细胞间异质性，而细胞类型水平的对比学习维持了不同细胞群之间的宏观生物变异。这些机制协同作用，以确保准确的imputation和有效地解决过度平滑的挑战。对各种模拟和真实数据集的综合评估证实，scGACL在准确恢复基因表达和改善下游分析（如细胞聚类、基因差异表达分析和细胞轨迹推断）方面始终优于现有方法。

{"title":"scGACL: a generative adversarial network with multi-scale contrastive learning for accurate single-cell RNA sequencing imputation.","authors":"Yanlin Jiang, Mengyuan Zhao, Jiahui Yan, Jijun Tang, Fei Guo","doi":"10.1093/bib/bbag018","DOIUrl":"10.1093/bib/bbag018","url":null,"abstract":"Single-cell RNA sequencing is a powerful technology for investigating cell-to-cell heterogeneity, yet its application is often hindered by dropout events, making accurate imputation essential for downstream analyses. Existing imputation methods, however, frequently suffer from the over-smoothing problem, which results in the loss of cell-to-cell heterogeneity in the imputed outcomes and affects downstream analyses. To overcome this limitation, we propose scGACL, a generative adversarial network (GAN) integrated with multi-scale contrastive learning. The GAN architecture facilitates the distribution of the imputed data to approximate that of the real data. To fundamentally address over-smoothing, the model incorporates a multi-scale contrastive learning mechanism: cell-level contrastive learning preserves fine-grained cell-to-cell heterogeneity, while cell-type-level contrastive learning maintains macroscopic biological variation across different cellular groups. These mechanisms function synergistically to ensure accurate imputation and effectively address the over-smoothing challenge. Comprehensive evaluations across diverse simulated and real-world datasets confirm that scGACL consistently outperforms existing methods in accurately recovering gene expression and improving downstream analyses such as cell clustering, gene differential expression analysis, and cell trajectory inference.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating multi-structure covalent docking with machine-learning consensus scoring enhances potency ranking of human acetylcholinesterase inhibitors. 将多结构共价对接与机器学习共识评分相结合，提高了人乙酰胆碱酯酶抑制剂的效价排序。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag028

Chaitanya K Jaladanki, Achal Ajeet Rayakar, Yap Xiu Huan, Hao Fan

Acetylcholinesterase (AChE) inhibition is a key mechanism in the treatment of neurodegenerative diseases and in counteracting toxic exposures to pesticides and nerve agents. However, accurately ranking the potency of covalently binding AChE inhibitors remains challenging due to the enzyme's structural flexibility and the chemical diversity of their covalent warheads. In this study, we developed an in silico protocol that integrates multi-structure covalent docking and machine-learning (ML) consensus scoring to improve docking-based potency ranking among covalent AChE inhibitors. We analyzed 65 ligand-bound (holo) human AChE crystal structures using hierarchical clustering to identify four representative conformations, along with one high-resolution apo structure, for multi-structure docking. A curated library of 412 organophosphate and carbamate inhibitors was then docked covalently and non-covalently into each receptor conformation. The resulting docking scores were evaluated against inhibitors' experimental logIC50 values using Spearman's rank correlation coefficient (rs). Covalent docking outperformed non-covalent docking (rs values up to 0.54 versus 0.18), and our ML consensus model trained on the five structures' covalent docking scores achieved the highest predictive accuracy (rs = 0.70), surpassing all single-structure and heuristic consensus baselines. Chemical cluster analysis revealed structure-activity trends based on ligand flexibility, polarity, and aromaticity. SHapley Additive exPlanations analysis highlighted the ML consensus model's ability to flexibly distribute the influence each structure's scores played on its predictions. It identified and exploited relationships based on its training dataset that would be difficult to anticipate through a manual analysis of individual structures' docking performance metrics. This framework is broadly applicable to other covalently targeted proteins, offering a generalizable and interpretable strategy for docking-based potency ranking.

乙酰胆碱酯酶（AChE）抑制是治疗神经退行性疾病和对抗农药和神经毒剂中毒暴露的关键机制。然而，由于酶的结构灵活性和其共价弹头的化学多样性，准确地对共价结合AChE抑制剂的效力进行排名仍然具有挑战性。在这项研究中，我们开发了一种集成了多结构共价对接和机器学习（ML）共识评分的硅协议，以提高共价AChE抑制剂之间基于对接的效价排名。我们使用分层聚类分析了65个配体结合（holo）人类AChE晶体结构，确定了四个具有代表性的构象，以及一个高分辨率载脂蛋白结构，用于多结构对接。然后将412种有机磷和氨基甲酸酯抑制剂以共价和非共价方式停靠到每个受体构象中。使用Spearman等级相关系数（rs）对抑制剂的实验logIC50值进行评估。共价对接优于非共价对接（rs值高达0.54对0.18），我们的机器学习共识模型在五种结构的共价对接得分上训练获得了最高的预测精度（rs = 0.70），超过了所有单一结构和启发式共识基线。化学聚类分析揭示了基于配体柔韧性、极性和芳香性的结构-活性趋势。SHapley加性解释分析强调了ML共识模型灵活分配每个结构分数对其预测的影响的能力。它根据训练数据集识别并利用了难以通过人工分析单个结构对接性能指标来预测的关系。该框架广泛适用于其他共价靶向蛋白，为基于对接的效价排序提供了一种通用且可解释的策略。

{"title":"Integrating multi-structure covalent docking with machine-learning consensus scoring enhances potency ranking of human acetylcholinesterase inhibitors.","authors":"Chaitanya K Jaladanki, Achal Ajeet Rayakar, Yap Xiu Huan, Hao Fan","doi":"10.1093/bib/bbag028","DOIUrl":"10.1093/bib/bbag028","url":null,"abstract":"Acetylcholinesterase (AChE) inhibition is a key mechanism in the treatment of neurodegenerative diseases and in counteracting toxic exposures to pesticides and nerve agents. However, accurately ranking the potency of covalently binding AChE inhibitors remains challenging due to the enzyme's structural flexibility and the chemical diversity of their covalent warheads. In this study, we developed an in silico protocol that integrates multi-structure covalent docking and machine-learning (ML) consensus scoring to improve docking-based potency ranking among covalent AChE inhibitors. We analyzed 65 ligand-bound (holo) human AChE crystal structures using hierarchical clustering to identify four representative conformations, along with one high-resolution apo structure, for multi-structure docking. A curated library of 412 organophosphate and carbamate inhibitors was then docked covalently and non-covalently into each receptor conformation. The resulting docking scores were evaluated against inhibitors' experimental logIC50 values using Spearman's rank correlation coefficient (rs). Covalent docking outperformed non-covalent docking (rs values up to 0.54 versus 0.18), and our ML consensus model trained on the five structures' covalent docking scores achieved the highest predictive accuracy (rs = 0.70), surpassing all single-structure and heuristic consensus baselines. Chemical cluster analysis revealed structure-activity trends based on ligand flexibility, polarity, and aromaticity. SHapley Additive exPlanations analysis highlighted the ML consensus model's ability to flexibly distribute the influence each structure's scores played on its predictions. It identified and exploited relationships based on its training dataset that would be difficult to anticipate through a manual analysis of individual structures' docking performance metrics. This framework is broadly applicable to other covalently targeted proteins, offering a generalizable and interpretable strategy for docking-based potency ranking.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866926/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Signal-based spatial domain identification of spatially resolved transcriptomics with multigraph fusion. 基于多图融合的空间分解转录组学的信号空间域识别。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag052

Yaxiong Ma, Yu Wang, Xiaoke Ma

Spatially resolved transcriptomics (SRT) measures transcriptomes of cells within intact biological tissues, providing unprecedented opportunities to investigate tissue micro-environments, where spatial domains are modeled as clusters of spatially neighboring cells. Current methods for the identification of spatial domain from SRT mainly rely on expression profiles and spatial coordinates of cells, which ignore intercellular interactions among them, resulting in high sensitivity and low accuracy. To bridge these gaps, we introduce a novel framework, called SiDMGF (Signal-based Domain identification with Multi-Graph Fusion), that integrates gene set-derived signaling and spatial graphs to jointly model biological context, spatial information, and gene expression of cell embedding, thereby dramatically improving accuracy and robustness of performance of algorithms for spatial domain identification. Experimental results demonstrate that SiDMGF consistently outperforms state-of-the-art methods across multiple benchmark datasets and achieves superior domain identification performance on diverse spatial sequence platforms. Furthermore, we demonstrate that the proposed SiDMGF can also be effectively applied to cancer-related tissue samples, accurately delineating micro-environment heterogeneity within tumor slice.

空间解析转录组学（SRT）测量完整生物组织内细胞的转录组，为研究组织微环境提供了前所未有的机会，其中空间域被建模为空间相邻细胞的集群。目前的SRT空间域识别方法主要依赖于细胞的表达谱和空间坐标，忽略了细胞间的相互作用，灵敏度高，精度低。为了弥补这些差距，我们引入了一个新的框架，称为SiDMGF（基于信号的多图融合域识别），它集成了基因集衍生的信号和空间图，共同模拟生物背景、空间信息和细胞嵌入的基因表达，从而显著提高了空间域识别算法的准确性和鲁棒性。实验结果表明，在多个基准数据集上，SiDMGF始终优于最先进的方法，并在不同的空间序列平台上取得了优异的域识别性能。此外，我们证明了所提出的SiDMGF也可以有效地应用于癌症相关组织样本，准确地描绘肿瘤切片内的微环境异质性。

{"title":"Signal-based spatial domain identification of spatially resolved transcriptomics with multigraph fusion.","authors":"Yaxiong Ma, Yu Wang, Xiaoke Ma","doi":"10.1093/bib/bbag052","DOIUrl":"https://doi.org/10.1093/bib/bbag052","url":null,"abstract":"Spatially resolved transcriptomics (SRT) measures transcriptomes of cells within intact biological tissues, providing unprecedented opportunities to investigate tissue micro-environments, where spatial domains are modeled as clusters of spatially neighboring cells. Current methods for the identification of spatial domain from SRT mainly rely on expression profiles and spatial coordinates of cells, which ignore intercellular interactions among them, resulting in high sensitivity and low accuracy. To bridge these gaps, we introduce a novel framework, called SiDMGF (Signal-based Domain identification with Multi-Graph Fusion), that integrates gene set-derived signaling and spatial graphs to jointly model biological context, spatial information, and gene expression of cell embedding, thereby dramatically improving accuracy and robustness of performance of algorithms for spatial domain identification. Experimental results demonstrate that SiDMGF consistently outperforms state-of-the-art methods across multiple benchmark datasets and achieves superior domain identification performance on diverse spatial sequence platforms. Furthermore, we demonstrate that the proposed SiDMGF can also be effectively applied to cancer-related tissue samples, accurately delineating micro-environment heterogeneity within tumor slice.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146164232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Global and local integrated gradient-based diffusion model for de novo drug design. 基于全局和局部集成梯度的新药物设计扩散模型。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag033

Sejin Park, Minjae Chung, Hyunju Lee

In de novo drug design, deep learning-based approaches have become essential to efficiently navigate the vast chemical space of drug-like molecules. Recently, diffusion-based models have attracted significant attention in the generation of target-binding molecules. However, these models have difficulty in simultaneously optimizing the binding affinity and drug-like properties and require high computational costs because of the long and sequential denoising process. To address these limitations, we propose the Global and local integrated gradient-based Diffusion Model (GlintDM). GlintDM introduces a significantly faster denoising process, namely skip transition, by leveraging global gradients and local gradients. Due to the fast denoising process, GlintDM can perform the following three phases during the molecule generation: position refinement, candidate evaluation, and ligand resampling. These phases allow GlintDM to identify optimal binding positions to the target protein and generate molecules satisfying multi-objective molecular properties. As a result, GlintDM outperforms other methods on both the CrossDocked and Binding MOAD datasets for Vina-related scores. Further validation through the PoseBusters test and assessment of molecular properties, such as steric clash and geometric properties, confirm that GlintDM can generate stable and high-quality molecules.

在新药设计中，基于深度学习的方法对于有效地驾驭药物类分子的巨大化学空间至关重要。近年来，基于扩散的模型在靶结合分子的生成中引起了广泛的关注。然而，这些模型难以同时优化结合亲和力和类药物性质，并且由于去噪过程漫长且顺序，需要较高的计算成本。为了解决这些限制，我们提出了基于梯度的全局和局部集成扩散模型（GlintDM）。GlintDM引入了一个明显更快的去噪过程，即跳跃过渡，通过利用全局梯度和局部梯度。由于去噪过程快速，GlintDM在分子生成过程中可以完成以下三个阶段：位置细化、候选评估和配体重采样。这些阶段允许GlintDM识别与靶蛋白的最佳结合位置，并生成满足多目标分子特性的分子。因此，GlintDM在cross - docked和Binding MOAD数据集上的vina相关评分都优于其他方法。通过PoseBusters测试和分子特性（如空间碰撞和几何特性）的评估，进一步验证了GlintDM可以生成稳定、高质量的分子。

{"title":"Global and local integrated gradient-based diffusion model for de novo drug design.","authors":"Sejin Park, Minjae Chung, Hyunju Lee","doi":"10.1093/bib/bbag033","DOIUrl":"10.1093/bib/bbag033","url":null,"abstract":"In de novo drug design, deep learning-based approaches have become essential to efficiently navigate the vast chemical space of drug-like molecules. Recently, diffusion-based models have attracted significant attention in the generation of target-binding molecules. However, these models have difficulty in simultaneously optimizing the binding affinity and drug-like properties and require high computational costs because of the long and sequential denoising process. To address these limitations, we propose the Global and local integrated gradient-based Diffusion Model (GlintDM). GlintDM introduces a significantly faster denoising process, namely skip transition, by leveraging global gradients and local gradients. Due to the fast denoising process, GlintDM can perform the following three phases during the molecule generation: position refinement, candidate evaluation, and ligand resampling. These phases allow GlintDM to identify optimal binding positions to the target protein and generate molecules satisfying multi-objective molecular properties. As a result, GlintDM outperforms other methods on both the CrossDocked and Binding MOAD datasets for Vina-related scores. Further validation through the PoseBusters test and assessment of molecular properties, such as steric clash and geometric properties, confirm that GlintDM can generate stable and high-quality molecules.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874906/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comprehensive review and assessment of machine learning approaches for host-pathogen protein-protein interaction prediction. 宿主-病原体蛋白质-蛋白质相互作用预测的机器学习方法综述与评估。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag051

Fatima Noor, Muhammad Tahir Ul Qamar

Predicting host-pathogen protein-protein interactions (PPIs) is a cornerstone of modern infectious disease research, offering unparalleled insights into the molecular mechanisms underlying infection and immune evasion. Despite its transformative potential, the field faces persistent challenges, including limited experimental data, class imbalance, and the dynamic evolution of pathogens. The current study explores cutting-edge computational approaches that have redefined host-pathogen protein-protein interaction (HP-PPI) prediction. Notably, transfer learning has emerged as a game changer, enabling models to leverage knowledge from well-characterized systems to predict interactions in previously underexplored pathogens. Hybrid and ensemble models have proven highly effective, combining the strengths of diverse algorithms to capture the complexity of biological interactions. Explainable AI tools are now bridging the gap between computational predictions and biological interpretability, offering actionable insights into key interaction drivers. Additionally, the review discusses advanced data integration techniques, such as multi-omics fusion and graph-based learning, which explore new dimensions in HP-PPI research. This synthesis of challenges, solutions, and future perspectives highlights a paradigm shift in computational biology, in which scalable, interpretable, and biologically informed models pave the way for breakthroughs in therapeutic discovery, vaccine development, and precision medicine. Our review sets the stage for future advancements, emphasizing the potential of next-generation technologies to unravel the intricate dance between hosts and pathogens.

预测宿主-病原体蛋白质-蛋白质相互作用（PPIs）是现代传染病研究的基石，为感染和免疫逃避的分子机制提供了无与伦比的见解。尽管具有变革潜力，但该领域仍面临着持续的挑战，包括实验数据有限，类别不平衡以及病原体的动态进化。目前的研究探索了重新定义宿主-病原体蛋白质-蛋白质相互作用（HP-PPI）预测的尖端计算方法。值得注意的是，迁移学习已经成为游戏规则的改变者，使模型能够利用来自特征良好的系统的知识来预测以前未被充分探索的病原体的相互作用。混合和集成模型已被证明是非常有效的，结合了不同算法的优势来捕捉生物相互作用的复杂性。可解释的人工智能工具现在正在弥合计算预测和生物可解释性之间的差距，为关键的交互驱动因素提供可操作的见解。此外，本文还讨论了先进的数据集成技术，如多组学融合和基于图的学习，这些技术为HP-PPI研究探索了新的维度。这种挑战、解决方案和未来前景的综合凸显了计算生物学的范式转变，其中可扩展、可解释和生物学信息的模型为治疗发现、疫苗开发和精准医学的突破铺平了道路。我们的综述为未来的进展奠定了基础，强调了下一代技术解开宿主和病原体之间复杂舞蹈的潜力。

{"title":"Comprehensive review and assessment of machine learning approaches for host-pathogen protein-protein interaction prediction.","authors":"Fatima Noor, Muhammad Tahir Ul Qamar","doi":"10.1093/bib/bbag051","DOIUrl":"10.1093/bib/bbag051","url":null,"abstract":"Predicting host-pathogen protein-protein interactions (PPIs) is a cornerstone of modern infectious disease research, offering unparalleled insights into the molecular mechanisms underlying infection and immune evasion. Despite its transformative potential, the field faces persistent challenges, including limited experimental data, class imbalance, and the dynamic evolution of pathogens. The current study explores cutting-edge computational approaches that have redefined host-pathogen protein-protein interaction (HP-PPI) prediction. Notably, transfer learning has emerged as a game changer, enabling models to leverage knowledge from well-characterized systems to predict interactions in previously underexplored pathogens. Hybrid and ensemble models have proven highly effective, combining the strengths of diverse algorithms to capture the complexity of biological interactions. Explainable AI tools are now bridging the gap between computational predictions and biological interpretability, offering actionable insights into key interaction drivers. Additionally, the review discusses advanced data integration techniques, such as multi-omics fusion and graph-based learning, which explore new dimensions in HP-PPI research. This synthesis of challenges, solutions, and future perspectives highlights a paradigm shift in computational biology, in which scalable, interpretable, and biologically informed models pave the way for breakthroughs in therapeutic discovery, vaccine development, and precision medicine. Our review sets the stage for future advancements, emphasizing the potential of next-generation technologies to unravel the intricate dance between hosts and pathogens.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12888821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146156175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corrections to the following abstracts. 对以下摘要的更正。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag080

引用次数: 0

GPCRact: a hierarchical framework for predicting ligand-induced GPCR activity via allosteric communication modeling. GPCRact：通过变构通信模型预测配体诱导的GPCR活性的分层框架。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf719

Hyojin Son, Gwan-Su Yi

Accurate prediction of ligand-induced activity for G-protein-coupled receptors (GPCRs) is a cornerstone of drug discovery, yet it is challenged by the need to model allosteric communication-the long-range signaling linking ligand binding to distal conformational changes. Prevailing sequence-based models often fail to capture these three-dimensional dynamics, a limitation frequently masked by averaged performance on simpler Class A targets. To address this, we introduce GPCRact, a novel framework that models the biophysical principles of allosteric modulation in GPCR activation. It first constructs a high-resolution, three-dimensional structure-aware graph from the heavy-atom coordinates of functionally critical residues at binding and allosteric sites. A dual attention architecture then captures the activation process: cross-attention encodes the initial ligand-protein interaction at the binding site, whereas self-attention learns the subsequent intra-protein signal propagation. This hierarchical architecture is built upon an E(n)-Equivariant Graph Neural Network (EGNN) to explicitly model conformational consequences of ligand binding, and is further refined with a tailored loss function and inference logic to mitigate error propagation. Underpinned by GPCRactDB, a comprehensive database we constructed for this study, GPCRact not only achieves state-of-the-art performance but also demonstrates robustly superior accuracy on a curated benchmark of allosterically complex receptors where existing models systematically underperform. Crucially, analysis of the learned attention weights confirms that the model identifies biologically validated allosteric pathways, offering a significant step toward resolving the black box nature of previous methods. Thus, GPCRact provides a more accurate, interpretable, and mechanistically-grounded solution to a long-standing challenge, paving the way for effective structure-guided drug discovery.

准确预测配体诱导的g蛋白偶联受体（gpcr）的活性是药物发现的基石，但它受到变构通信模型的挑战，变构通信是连接配体结合和远端构象变化的远程信号。主流的基于序列的模型常常不能捕捉到这些三维动态，这一限制常常被更简单的a类目标的平均性能所掩盖。为了解决这个问题，我们引入了GPCRact，这是一个新的框架，模拟了GPCR激活中变构调节的生物物理原理。它首先从结合位点和变构位点的功能关键残基的重原子坐标构建了一个高分辨率的三维结构感知图。双注意结构捕获了激活过程：交叉注意编码结合位点的初始配体-蛋白质相互作用，而自注意学习随后的蛋白质内信号传播。这种分层结构建立在E(n)-等变图神经网络（EGNN）的基础上，以明确地模拟配体结合的构象后果，并通过定制的损失函数和推理逻辑进一步改进，以减轻错误传播。在GPCRactDB（我们为本研究构建的一个综合数据库）的支持下，GPCRact不仅实现了最先进的性能，而且在现有模型系统表现不佳的变构复杂受体的精心基准上显示出强大的优越准确性。至关重要的是，对学习到的注意力权重的分析证实了该模型识别了生物学上有效的变构途径，为解决以前方法的黑箱性质提供了重要的一步。因此，GPCRact为长期存在的挑战提供了更准确、可解释和机械基础的解决方案，为有效的结构导向药物发现铺平了道路。

{"title":"GPCRact: a hierarchical framework for predicting ligand-induced GPCR activity via allosteric communication modeling.","authors":"Hyojin Son, Gwan-Su Yi","doi":"10.1093/bib/bbaf719","DOIUrl":"10.1093/bib/bbaf719","url":null,"abstract":"Accurate prediction of ligand-induced activity for G-protein-coupled receptors (GPCRs) is a cornerstone of drug discovery, yet it is challenged by the need to model allosteric communication-the long-range signaling linking ligand binding to distal conformational changes. Prevailing sequence-based models often fail to capture these three-dimensional dynamics, a limitation frequently masked by averaged performance on simpler Class A targets. To address this, we introduce GPCRact, a novel framework that models the biophysical principles of allosteric modulation in GPCR activation. It first constructs a high-resolution, three-dimensional structure-aware graph from the heavy-atom coordinates of functionally critical residues at binding and allosteric sites. A dual attention architecture then captures the activation process: cross-attention encodes the initial ligand-protein interaction at the binding site, whereas self-attention learns the subsequent intra-protein signal propagation. This hierarchical architecture is built upon an E(n)-Equivariant Graph Neural Network (EGNN) to explicitly model conformational consequences of ligand binding, and is further refined with a tailored loss function and inference logic to mitigate error propagation. Underpinned by GPCRactDB, a comprehensive database we constructed for this study, GPCRact not only achieves state-of-the-art performance but also demonstrates robustly superior accuracy on a curated benchmark of allosterically complex receptors where existing models systematically underperform. Crucially, analysis of the learned attention weights confirms that the model identifies biologically validated allosteric pathways, offering a significant step toward resolving the black box nature of previous methods. Thus, GPCRact provides a more accurate, interpretable, and mechanistically-grounded solution to a long-standing challenge, paving the way for effective structure-guided drug discovery.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805254/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A systematic review of molecular representation learning foundation models. 分子表征学习基础模型的系统综述。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf703

Bosheng Song, Jiayi Zhang, Ying Liu, Yuansheng Liu, Jing Jiang, Sisi Yuan, Xia Zhen, Yiping Liu

Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.

分子表示学习（MRL）是利用计算方法进行药物发现的基础，能够将分子结构和性质转换为数值向量。这些载体作为机器学习模型的输入，促进了分子属性、功能和反应的预测和分析。基础模型的出现给MRL带来了新的机遇和挑战。这些模型提高了在稀缺数据中的泛化和迁移能力。通过预训练和微调，基础模型可以适应不同的领域。它们强大的编码和生成能力也允许将分子数据转换为更具表现力的形式。本文提供了当前主流分子描述符和数据集的详细回顾，主要集中在小分子的表示，而不包括大分子，如蛋白质和肽。它根据输入形式将基础模型分为两大类：基于单模态的模型和基于多模态的模型。对于每个类别，确定了具有代表性的模型，并评估了它们的优缺点。此外，我们系统地总结了MRL基础模型的四种核心预训练策略，分析了它们的任务设计、适用场景以及对下游性能的影响。此外，还讨论了分子表示基础模型在药物发现和开发中的应用，以及模型可解释性的现状。最后，对MRL基础模型的未来发展方向进行了展望。

{"title":"A systematic review of molecular representation learning foundation models.","authors":"Bosheng Song, Jiayi Zhang, Ying Liu, Yuansheng Liu, Jing Jiang, Sisi Yuan, Xia Zhen, Yiping Liu","doi":"10.1093/bib/bbaf703","DOIUrl":"10.1093/bib/bbaf703","url":null,"abstract":"Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784970/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0