Genome assembly has been a cornerstone of bioinformatics for decades, with faster and more accurate assembly of unknown genomes remaining a critical challenge. However, genome diversity, structural variations, insufficient sequencing depth, and limitations of current algorithms often lead to numerous gaps during assembly, hindering the construction of high-quality reference genomes. While various assembly methods and software tools have been developed, most exhibit low efficiency in gap filling and fail to account for the intrinsic structural properties of genomic sequences. Here, we present DL-GapFilling, a deep learning-based framework for genome assembly and gap filling. DL-GapFilling leverages a novel Deep Filling Neural Network model to efficiently extract and contextualize flanking sequence information, and incorporates the BeamStar contraction-expand algorithm, which integrates a redefined cost function, an enhanced search strategy, and genomic structural priors to improve both generalization and efficiency in gap filling. In addition, a PredictionFilter mechanism is introduced to selectively retain high-confidence predictions, mitigating the impact of poorly predicted sequences on assembly quality. Experimental results demonstrate that DL-GapFilling significantly improves gap-filling performance across multiple plant or algal genome datasets, achieving increases of 15.6%, 6.1%, 16.7%, 5.5%, and 23.5% in the number of gaps filled compared to traditional tools, and outperforming existing DL-based methods in both efficiency and accuracy. These findings underscore the potential of DL-GapFilling as a powerful tool for advancing genome assembly research.
{"title":"DL-GapFilling: a novel deep learning framework for improved plant genome gap filling.","authors":"Yu Chen, Zihao Wang, Gang Wang, Guohua Wang","doi":"10.1093/bib/bbag007","DOIUrl":"10.1093/bib/bbag007","url":null,"abstract":"<p><p>Genome assembly has been a cornerstone of bioinformatics for decades, with faster and more accurate assembly of unknown genomes remaining a critical challenge. However, genome diversity, structural variations, insufficient sequencing depth, and limitations of current algorithms often lead to numerous gaps during assembly, hindering the construction of high-quality reference genomes. While various assembly methods and software tools have been developed, most exhibit low efficiency in gap filling and fail to account for the intrinsic structural properties of genomic sequences. Here, we present DL-GapFilling, a deep learning-based framework for genome assembly and gap filling. DL-GapFilling leverages a novel Deep Filling Neural Network model to efficiently extract and contextualize flanking sequence information, and incorporates the BeamStar contraction-expand algorithm, which integrates a redefined cost function, an enhanced search strategy, and genomic structural priors to improve both generalization and efficiency in gap filling. In addition, a PredictionFilter mechanism is introduced to selectively retain high-confidence predictions, mitigating the impact of poorly predicted sequences on assembly quality. Experimental results demonstrate that DL-GapFilling significantly improves gap-filling performance across multiple plant or algal genome datasets, achieving increases of 15.6%, 6.1%, 16.7%, 5.5%, and 23.5% in the number of gaps filled compared to traditional tools, and outperforming existing DL-based methods in both efficiency and accuracy. These findings underscore the potential of DL-GapFilling as a powerful tool for advancing genome assembly research.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146050282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models have revolutionized natural language processing by effectively modeling complex semantics and capturing long-range contextual relationships. Inspired by these advancements, genome language models (gLMs) have recently emerged, conceptualizing DNA and RNA sequences as biological texts and enabling the identification of intricate genomic grammar and distant regulatory interactions. This review examines the need for gLMs, emphasizing their capacity to overcome the limitations of traditional deep learning approaches in genomic sequence characterization. We comprehensively survey contemporary gLM architectures, including Transformer models, Hyena convolutions, and state space models, as well as various sequence tokenization strategies, assessing their applicability, and effectiveness across diverse genomic applications. Additionally, we discuss foundational pretraining strategies and provide an overview of genomic pretraining datasets spanning multiple species and functional domains. We critically analyze evaluation methodologies, including supervised, zero-shot, and few-shot learning paradigms, as well as fine-tuning approaches. An extensive taxonomy of downstream tasks is presented, alongside a summary of existing benchmarks and emerging trends. Finally, we contemplate key challenges such as data scarcity, interpretability, and the computational demands of genomic modeling, and propose a roadmap to guide future advances in genome language modeling.
{"title":"A comprehensive survey of genome language models in bioinformatics.","authors":"Liyuan Shu, Jiao Tang, Xiaoyu Guan, Daoqiang Zhang","doi":"10.1093/bib/bbaf724","DOIUrl":"10.1093/bib/bbaf724","url":null,"abstract":"<p><p>Large language models have revolutionized natural language processing by effectively modeling complex semantics and capturing long-range contextual relationships. Inspired by these advancements, genome language models (gLMs) have recently emerged, conceptualizing DNA and RNA sequences as biological texts and enabling the identification of intricate genomic grammar and distant regulatory interactions. This review examines the need for gLMs, emphasizing their capacity to overcome the limitations of traditional deep learning approaches in genomic sequence characterization. We comprehensively survey contemporary gLM architectures, including Transformer models, Hyena convolutions, and state space models, as well as various sequence tokenization strategies, assessing their applicability, and effectiveness across diverse genomic applications. Additionally, we discuss foundational pretraining strategies and provide an overview of genomic pretraining datasets spanning multiple species and functional domains. We critically analyze evaluation methodologies, including supervised, zero-shot, and few-shot learning paradigms, as well as fine-tuning approaches. An extensive taxonomy of downstream tasks is presented, alongside a summary of existing benchmarks and emerging trends. Finally, we contemplate key challenges such as data scarcity, interpretability, and the computational demands of genomic modeling, and propose a roadmap to guide future advances in genome language modeling.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell RNA sequencing is a powerful technology for investigating cell-to-cell heterogeneity, yet its application is often hindered by dropout events, making accurate imputation essential for downstream analyses. Existing imputation methods, however, frequently suffer from the over-smoothing problem, which results in the loss of cell-to-cell heterogeneity in the imputed outcomes and affects downstream analyses. To overcome this limitation, we propose scGACL, a generative adversarial network (GAN) integrated with multi-scale contrastive learning. The GAN architecture facilitates the distribution of the imputed data to approximate that of the real data. To fundamentally address over-smoothing, the model incorporates a multi-scale contrastive learning mechanism: cell-level contrastive learning preserves fine-grained cell-to-cell heterogeneity, while cell-type-level contrastive learning maintains macroscopic biological variation across different cellular groups. These mechanisms function synergistically to ensure accurate imputation and effectively address the over-smoothing challenge. Comprehensive evaluations across diverse simulated and real-world datasets confirm that scGACL consistently outperforms existing methods in accurately recovering gene expression and improving downstream analyses such as cell clustering, gene differential expression analysis, and cell trajectory inference.
{"title":"scGACL: a generative adversarial network with multi-scale contrastive learning for accurate single-cell RNA sequencing imputation.","authors":"Yanlin Jiang, Mengyuan Zhao, Jiahui Yan, Jijun Tang, Fei Guo","doi":"10.1093/bib/bbag018","DOIUrl":"10.1093/bib/bbag018","url":null,"abstract":"<p><p>Single-cell RNA sequencing is a powerful technology for investigating cell-to-cell heterogeneity, yet its application is often hindered by dropout events, making accurate imputation essential for downstream analyses. Existing imputation methods, however, frequently suffer from the over-smoothing problem, which results in the loss of cell-to-cell heterogeneity in the imputed outcomes and affects downstream analyses. To overcome this limitation, we propose scGACL, a generative adversarial network (GAN) integrated with multi-scale contrastive learning. The GAN architecture facilitates the distribution of the imputed data to approximate that of the real data. To fundamentally address over-smoothing, the model incorporates a multi-scale contrastive learning mechanism: cell-level contrastive learning preserves fine-grained cell-to-cell heterogeneity, while cell-type-level contrastive learning maintains macroscopic biological variation across different cellular groups. These mechanisms function synergistically to ensure accurate imputation and effectively address the over-smoothing challenge. Comprehensive evaluations across diverse simulated and real-world datasets confirm that scGACL consistently outperforms existing methods in accurately recovering gene expression and improving downstream analyses such as cell clustering, gene differential expression analysis, and cell trajectory inference.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaitanya K Jaladanki, Achal Ajeet Rayakar, Yap Xiu Huan, Hao Fan
Acetylcholinesterase (AChE) inhibition is a key mechanism in the treatment of neurodegenerative diseases and in counteracting toxic exposures to pesticides and nerve agents. However, accurately ranking the potency of covalently binding AChE inhibitors remains challenging due to the enzyme's structural flexibility and the chemical diversity of their covalent warheads. In this study, we developed an in silico protocol that integrates multi-structure covalent docking and machine-learning (ML) consensus scoring to improve docking-based potency ranking among covalent AChE inhibitors. We analyzed 65 ligand-bound (holo) human AChE crystal structures using hierarchical clustering to identify four representative conformations, along with one high-resolution apo structure, for multi-structure docking. A curated library of 412 organophosphate and carbamate inhibitors was then docked covalently and non-covalently into each receptor conformation. The resulting docking scores were evaluated against inhibitors' experimental logIC50 values using Spearman's rank correlation coefficient (rs). Covalent docking outperformed non-covalent docking (rs values up to 0.54 versus 0.18), and our ML consensus model trained on the five structures' covalent docking scores achieved the highest predictive accuracy (rs = 0.70), surpassing all single-structure and heuristic consensus baselines. Chemical cluster analysis revealed structure-activity trends based on ligand flexibility, polarity, and aromaticity. SHapley Additive exPlanations analysis highlighted the ML consensus model's ability to flexibly distribute the influence each structure's scores played on its predictions. It identified and exploited relationships based on its training dataset that would be difficult to anticipate through a manual analysis of individual structures' docking performance metrics. This framework is broadly applicable to other covalently targeted proteins, offering a generalizable and interpretable strategy for docking-based potency ranking.
{"title":"Integrating multi-structure covalent docking with machine-learning consensus scoring enhances potency ranking of human acetylcholinesterase inhibitors.","authors":"Chaitanya K Jaladanki, Achal Ajeet Rayakar, Yap Xiu Huan, Hao Fan","doi":"10.1093/bib/bbag028","DOIUrl":"10.1093/bib/bbag028","url":null,"abstract":"<p><p>Acetylcholinesterase (AChE) inhibition is a key mechanism in the treatment of neurodegenerative diseases and in counteracting toxic exposures to pesticides and nerve agents. However, accurately ranking the potency of covalently binding AChE inhibitors remains challenging due to the enzyme's structural flexibility and the chemical diversity of their covalent warheads. In this study, we developed an in silico protocol that integrates multi-structure covalent docking and machine-learning (ML) consensus scoring to improve docking-based potency ranking among covalent AChE inhibitors. We analyzed 65 ligand-bound (holo) human AChE crystal structures using hierarchical clustering to identify four representative conformations, along with one high-resolution apo structure, for multi-structure docking. A curated library of 412 organophosphate and carbamate inhibitors was then docked covalently and non-covalently into each receptor conformation. The resulting docking scores were evaluated against inhibitors' experimental logIC50 values using Spearman's rank correlation coefficient (rs). Covalent docking outperformed non-covalent docking (rs values up to 0.54 versus 0.18), and our ML consensus model trained on the five structures' covalent docking scores achieved the highest predictive accuracy (rs = 0.70), surpassing all single-structure and heuristic consensus baselines. Chemical cluster analysis revealed structure-activity trends based on ligand flexibility, polarity, and aromaticity. SHapley Additive exPlanations analysis highlighted the ML consensus model's ability to flexibly distribute the influence each structure's scores played on its predictions. It identified and exploited relationships based on its training dataset that would be difficult to anticipate through a manual analysis of individual structures' docking performance metrics. This framework is broadly applicable to other covalently targeted proteins, offering a generalizable and interpretable strategy for docking-based potency ranking.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866926/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatially resolved transcriptomics (SRT) measures transcriptomes of cells within intact biological tissues, providing unprecedented opportunities to investigate tissue micro-environments, where spatial domains are modeled as clusters of spatially neighboring cells. Current methods for the identification of spatial domain from SRT mainly rely on expression profiles and spatial coordinates of cells, which ignore intercellular interactions among them, resulting in high sensitivity and low accuracy. To bridge these gaps, we introduce a novel framework, called SiDMGF (Signal-based Domain identification with Multi-Graph Fusion), that integrates gene set-derived signaling and spatial graphs to jointly model biological context, spatial information, and gene expression of cell embedding, thereby dramatically improving accuracy and robustness of performance of algorithms for spatial domain identification. Experimental results demonstrate that SiDMGF consistently outperforms state-of-the-art methods across multiple benchmark datasets and achieves superior domain identification performance on diverse spatial sequence platforms. Furthermore, we demonstrate that the proposed SiDMGF can also be effectively applied to cancer-related tissue samples, accurately delineating micro-environment heterogeneity within tumor slice.
{"title":"Signal-based spatial domain identification of spatially resolved transcriptomics with multigraph fusion.","authors":"Yaxiong Ma, Yu Wang, Xiaoke Ma","doi":"10.1093/bib/bbag052","DOIUrl":"https://doi.org/10.1093/bib/bbag052","url":null,"abstract":"<p><p>Spatially resolved transcriptomics (SRT) measures transcriptomes of cells within intact biological tissues, providing unprecedented opportunities to investigate tissue micro-environments, where spatial domains are modeled as clusters of spatially neighboring cells. Current methods for the identification of spatial domain from SRT mainly rely on expression profiles and spatial coordinates of cells, which ignore intercellular interactions among them, resulting in high sensitivity and low accuracy. To bridge these gaps, we introduce a novel framework, called SiDMGF (Signal-based Domain identification with Multi-Graph Fusion), that integrates gene set-derived signaling and spatial graphs to jointly model biological context, spatial information, and gene expression of cell embedding, thereby dramatically improving accuracy and robustness of performance of algorithms for spatial domain identification. Experimental results demonstrate that SiDMGF consistently outperforms state-of-the-art methods across multiple benchmark datasets and achieves superior domain identification performance on diverse spatial sequence platforms. Furthermore, we demonstrate that the proposed SiDMGF can also be effectively applied to cancer-related tissue samples, accurately delineating micro-environment heterogeneity within tumor slice.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146164232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In de novo drug design, deep learning-based approaches have become essential to efficiently navigate the vast chemical space of drug-like molecules. Recently, diffusion-based models have attracted significant attention in the generation of target-binding molecules. However, these models have difficulty in simultaneously optimizing the binding affinity and drug-like properties and require high computational costs because of the long and sequential denoising process. To address these limitations, we propose the Global and local integrated gradient-based Diffusion Model (GlintDM). GlintDM introduces a significantly faster denoising process, namely skip transition, by leveraging global gradients and local gradients. Due to the fast denoising process, GlintDM can perform the following three phases during the molecule generation: position refinement, candidate evaluation, and ligand resampling. These phases allow GlintDM to identify optimal binding positions to the target protein and generate molecules satisfying multi-objective molecular properties. As a result, GlintDM outperforms other methods on both the CrossDocked and Binding MOAD datasets for Vina-related scores. Further validation through the PoseBusters test and assessment of molecular properties, such as steric clash and geometric properties, confirm that GlintDM can generate stable and high-quality molecules.
{"title":"Global and local integrated gradient-based diffusion model for de novo drug design.","authors":"Sejin Park, Minjae Chung, Hyunju Lee","doi":"10.1093/bib/bbag033","DOIUrl":"10.1093/bib/bbag033","url":null,"abstract":"<p><p>In de novo drug design, deep learning-based approaches have become essential to efficiently navigate the vast chemical space of drug-like molecules. Recently, diffusion-based models have attracted significant attention in the generation of target-binding molecules. However, these models have difficulty in simultaneously optimizing the binding affinity and drug-like properties and require high computational costs because of the long and sequential denoising process. To address these limitations, we propose the Global and local integrated gradient-based Diffusion Model (GlintDM). GlintDM introduces a significantly faster denoising process, namely skip transition, by leveraging global gradients and local gradients. Due to the fast denoising process, GlintDM can perform the following three phases during the molecule generation: position refinement, candidate evaluation, and ligand resampling. These phases allow GlintDM to identify optimal binding positions to the target protein and generate molecules satisfying multi-objective molecular properties. As a result, GlintDM outperforms other methods on both the CrossDocked and Binding MOAD datasets for Vina-related scores. Further validation through the PoseBusters test and assessment of molecular properties, such as steric clash and geometric properties, confirm that GlintDM can generate stable and high-quality molecules.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874906/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predicting host-pathogen protein-protein interactions (PPIs) is a cornerstone of modern infectious disease research, offering unparalleled insights into the molecular mechanisms underlying infection and immune evasion. Despite its transformative potential, the field faces persistent challenges, including limited experimental data, class imbalance, and the dynamic evolution of pathogens. The current study explores cutting-edge computational approaches that have redefined host-pathogen protein-protein interaction (HP-PPI) prediction. Notably, transfer learning has emerged as a game changer, enabling models to leverage knowledge from well-characterized systems to predict interactions in previously underexplored pathogens. Hybrid and ensemble models have proven highly effective, combining the strengths of diverse algorithms to capture the complexity of biological interactions. Explainable AI tools are now bridging the gap between computational predictions and biological interpretability, offering actionable insights into key interaction drivers. Additionally, the review discusses advanced data integration techniques, such as multi-omics fusion and graph-based learning, which explore new dimensions in HP-PPI research. This synthesis of challenges, solutions, and future perspectives highlights a paradigm shift in computational biology, in which scalable, interpretable, and biologically informed models pave the way for breakthroughs in therapeutic discovery, vaccine development, and precision medicine. Our review sets the stage for future advancements, emphasizing the potential of next-generation technologies to unravel the intricate dance between hosts and pathogens.
{"title":"Comprehensive review and assessment of machine learning approaches for host-pathogen protein-protein interaction prediction.","authors":"Fatima Noor, Muhammad Tahir Ul Qamar","doi":"10.1093/bib/bbag051","DOIUrl":"10.1093/bib/bbag051","url":null,"abstract":"<p><p>Predicting host-pathogen protein-protein interactions (PPIs) is a cornerstone of modern infectious disease research, offering unparalleled insights into the molecular mechanisms underlying infection and immune evasion. Despite its transformative potential, the field faces persistent challenges, including limited experimental data, class imbalance, and the dynamic evolution of pathogens. The current study explores cutting-edge computational approaches that have redefined host-pathogen protein-protein interaction (HP-PPI) prediction. Notably, transfer learning has emerged as a game changer, enabling models to leverage knowledge from well-characterized systems to predict interactions in previously underexplored pathogens. Hybrid and ensemble models have proven highly effective, combining the strengths of diverse algorithms to capture the complexity of biological interactions. Explainable AI tools are now bridging the gap between computational predictions and biological interpretability, offering actionable insights into key interaction drivers. Additionally, the review discusses advanced data integration techniques, such as multi-omics fusion and graph-based learning, which explore new dimensions in HP-PPI research. This synthesis of challenges, solutions, and future perspectives highlights a paradigm shift in computational biology, in which scalable, interpretable, and biologically informed models pave the way for breakthroughs in therapeutic discovery, vaccine development, and precision medicine. Our review sets the stage for future advancements, emphasizing the potential of next-generation technologies to unravel the intricate dance between hosts and pathogens.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12888821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146156175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Corrections to the following abstracts.","authors":"","doi":"10.1093/bib/bbag080","DOIUrl":"10.1093/bib/bbag080","url":null,"abstract":"","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12888816/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146156209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate prediction of ligand-induced activity for G-protein-coupled receptors (GPCRs) is a cornerstone of drug discovery, yet it is challenged by the need to model allosteric communication-the long-range signaling linking ligand binding to distal conformational changes. Prevailing sequence-based models often fail to capture these three-dimensional dynamics, a limitation frequently masked by averaged performance on simpler Class A targets. To address this, we introduce GPCRact, a novel framework that models the biophysical principles of allosteric modulation in GPCR activation. It first constructs a high-resolution, three-dimensional structure-aware graph from the heavy-atom coordinates of functionally critical residues at binding and allosteric sites. A dual attention architecture then captures the activation process: cross-attention encodes the initial ligand-protein interaction at the binding site, whereas self-attention learns the subsequent intra-protein signal propagation. This hierarchical architecture is built upon an E(n)-Equivariant Graph Neural Network (EGNN) to explicitly model conformational consequences of ligand binding, and is further refined with a tailored loss function and inference logic to mitigate error propagation. Underpinned by GPCRactDB, a comprehensive database we constructed for this study, GPCRact not only achieves state-of-the-art performance but also demonstrates robustly superior accuracy on a curated benchmark of allosterically complex receptors where existing models systematically underperform. Crucially, analysis of the learned attention weights confirms that the model identifies biologically validated allosteric pathways, offering a significant step toward resolving the black box nature of previous methods. Thus, GPCRact provides a more accurate, interpretable, and mechanistically-grounded solution to a long-standing challenge, paving the way for effective structure-guided drug discovery.
{"title":"GPCRact: a hierarchical framework for predicting ligand-induced GPCR activity via allosteric communication modeling.","authors":"Hyojin Son, Gwan-Su Yi","doi":"10.1093/bib/bbaf719","DOIUrl":"10.1093/bib/bbaf719","url":null,"abstract":"<p><p>Accurate prediction of ligand-induced activity for G-protein-coupled receptors (GPCRs) is a cornerstone of drug discovery, yet it is challenged by the need to model allosteric communication-the long-range signaling linking ligand binding to distal conformational changes. Prevailing sequence-based models often fail to capture these three-dimensional dynamics, a limitation frequently masked by averaged performance on simpler Class A targets. To address this, we introduce GPCRact, a novel framework that models the biophysical principles of allosteric modulation in GPCR activation. It first constructs a high-resolution, three-dimensional structure-aware graph from the heavy-atom coordinates of functionally critical residues at binding and allosteric sites. A dual attention architecture then captures the activation process: cross-attention encodes the initial ligand-protein interaction at the binding site, whereas self-attention learns the subsequent intra-protein signal propagation. This hierarchical architecture is built upon an E(n)-Equivariant Graph Neural Network (EGNN) to explicitly model conformational consequences of ligand binding, and is further refined with a tailored loss function and inference logic to mitigate error propagation. Underpinned by GPCRactDB, a comprehensive database we constructed for this study, GPCRact not only achieves state-of-the-art performance but also demonstrates robustly superior accuracy on a curated benchmark of allosterically complex receptors where existing models systematically underperform. Crucially, analysis of the learned attention weights confirms that the model identifies biologically validated allosteric pathways, offering a significant step toward resolving the black box nature of previous methods. Thus, GPCRact provides a more accurate, interpretable, and mechanistically-grounded solution to a long-standing challenge, paving the way for effective structure-guided drug discovery.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805254/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bosheng Song, Jiayi Zhang, Ying Liu, Yuansheng Liu, Jing Jiang, Sisi Yuan, Xia Zhen, Yiping Liu
Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.
{"title":"A systematic review of molecular representation learning foundation models.","authors":"Bosheng Song, Jiayi Zhang, Ying Liu, Yuansheng Liu, Jing Jiang, Sisi Yuan, Xia Zhen, Yiping Liu","doi":"10.1093/bib/bbaf703","DOIUrl":"10.1093/bib/bbaf703","url":null,"abstract":"<p><p>Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784970/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}