首页 > 最新文献

Briefings in bioinformatics最新文献

英文 中文
GFSeeker: a splicing-graph-based approach for accurate gene fusion detection from long-read RNA sequencing data. GFSeeker:一种基于剪接图的方法,用于从长读RNA测序数据中精确检测基因融合。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf702
Bingyan Wang, Heng Hu, Runtian Gao, Guohua Wang, Tao Jiang

Gene fusions are critical oncogenic drivers and therapeutic targets in diverse cancers. Long-read ribonucleic acid sequencing (RNA-seq) offers an unprecedented opportunity to resolve the full-length structure of fusion isoforms, but its high intrinsic error rates pose significant challenges to the precise identification of true fusion events. Here, we developed GFSeeker, an innovative splicing-graph-based computational framework for accurate gene fusion detection from long-read RNA-seq. GFSeeker employs a unique pipeline based on a splicing graph reference and a dual re-alignment validation to effectively overcome data noise from high error rates. Benchmarking across simulated, non-tumor, and cancer cell line datasets demonstrated GFSeeker's state-of-the-art performance, achieving 6%-15% higher F1 score compared to existing methods. Notably, GFSeeker successfully identified the known fusion event, MATN2-POP1, in the MCF-7 cancer cell line, missed by other tools, highlighting its superior sensitivity in resolving complex fusion events. These results validate GFSeeker as a powerful and reliable tool for gene fusion discovery, heralding its significant potential to advance cancer research and precision diagnostics.

基因融合是多种癌症的关键致癌驱动因素和治疗靶点。长读核糖核酸测序(RNA-seq)为解决融合异构体的全长结构提供了前所未有的机会,但其高固有错误率对准确识别真正的融合事件构成了重大挑战。在这里,我们开发了GFSeeker,这是一个创新的基于剪接图的计算框架,用于从长读RNA-seq中精确检测基因融合。GFSeeker采用了基于拼接图参考和双重重新对齐验证的独特管道,有效克服了高错误率带来的数据噪声。模拟、非肿瘤和癌细胞系数据集的基准测试表明,GFSeeker具有最先进的性能,与现有方法相比,F1得分提高了6%-15%。值得注意的是,GFSeeker成功地识别了MCF-7癌细胞系中已知的融合事件MATN2-POP1,这是其他工具无法识别的,突出了其在解决复杂融合事件方面的优越敏感性。这些结果验证了GFSeeker是一种强大而可靠的基因融合发现工具,预示着其在推进癌症研究和精确诊断方面的巨大潜力。
{"title":"GFSeeker: a splicing-graph-based approach for accurate gene fusion detection from long-read RNA sequencing data.","authors":"Bingyan Wang, Heng Hu, Runtian Gao, Guohua Wang, Tao Jiang","doi":"10.1093/bib/bbaf702","DOIUrl":"10.1093/bib/bbaf702","url":null,"abstract":"<p><p>Gene fusions are critical oncogenic drivers and therapeutic targets in diverse cancers. Long-read ribonucleic acid sequencing (RNA-seq) offers an unprecedented opportunity to resolve the full-length structure of fusion isoforms, but its high intrinsic error rates pose significant challenges to the precise identification of true fusion events. Here, we developed GFSeeker, an innovative splicing-graph-based computational framework for accurate gene fusion detection from long-read RNA-seq. GFSeeker employs a unique pipeline based on a splicing graph reference and a dual re-alignment validation to effectively overcome data noise from high error rates. Benchmarking across simulated, non-tumor, and cancer cell line datasets demonstrated GFSeeker's state-of-the-art performance, achieving 6%-15% higher F1 score compared to existing methods. Notably, GFSeeker successfully identified the known fusion event, MATN2-POP1, in the MCF-7 cancer cell line, missed by other tools, highlighting its superior sensitivity in resolving complex fusion events. These results validate GFSeeker as a powerful and reliable tool for gene fusion discovery, heralding its significant potential to advance cancer research and precision diagnostics.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777712/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GRAFT: a graph-aware fusion transformer for cancer driver gene prediction. GRAFT:用于癌症驱动基因预测的图形感知融合转换器。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf706
Sang-Pil Cho, Young-Rae Cho

Identifying cancer driver genes is essential for precision oncology, but existing computational methods are often limited by their reliance on single biological networks and their inability to capture long-range molecular dependencies. To address these challenges, we propose GRAFT, a Graph-Aware Fusion Transformer. This framework learns modality-specific features from protein-protein interactions, pathway co-occurrence, and gene semantic similarity using a multi-view graph encoder. These representations are further enriched with two auxiliary feature types: structural encodings derived from network topology and functional embeddings guided by curated gene sets. The integrated features are then processed by a transformer backbone, where a novel edge-attention bias makes the model explicitly sensitive to the underlying graph topologies, enabling the effective modeling of both local and global dependencies. Extensive evaluations demonstrate that GRAFT achieves competitive performance with leading state-of-the-art methods in pan-cancer analysis, while consistently delivering superior predictive accuracy across numerous specific cancer types. More importantly, a functional enrichment analysis of the novel candidate driver genes predicted by our model confirms their strong associations with key cancer-related processes, demonstrating the model's ability to make biologically plausible discoveries. By delivering a powerful and interpretable framework, our model not only advances the identification of cancer driver genes but also establishes a robust paradigm for multimodal data integration in systems biology. The source codes and datasets are publicly accessible at https://github.com/spcho-dev/GRAFT.

确定癌症驱动基因对精确肿瘤学至关重要,但现有的计算方法往往受到其依赖单一生物网络和无法捕获远程分子依赖性的限制。为了解决这些挑战,我们提出了GRAFT,一个图形感知融合变压器。该框架使用多视图图编码器从蛋白质相互作用、途径共发生和基因语义相似性中学习模式特异性特征。这些表示进一步丰富了两种辅助特征类型:来自网络拓扑的结构编码和由精心策划的基因集引导的功能嵌入。然后由变压器主干处理集成的特征,其中新颖的边缘注意偏差使模型显式地对底层图拓扑敏感,从而实现对局部和全局依赖关系的有效建模。广泛的评估表明,GRAFT在泛癌症分析中具有领先的最先进的方法,同时在许多特定癌症类型中始终如一地提供卓越的预测准确性。更重要的是,我们的模型预测的新的候选驱动基因的功能富集分析证实了它们与关键癌症相关过程的强烈关联,证明了该模型有能力做出生物学上合理的发现。通过提供一个强大且可解释的框架,我们的模型不仅推进了癌症驱动基因的识别,而且为系统生物学中的多模态数据集成建立了一个强大的范例。源代码和数据集可在https://github.com/spcho-dev/GRAFT公开访问。
{"title":"GRAFT: a graph-aware fusion transformer for cancer driver gene prediction.","authors":"Sang-Pil Cho, Young-Rae Cho","doi":"10.1093/bib/bbaf706","DOIUrl":"10.1093/bib/bbaf706","url":null,"abstract":"<p><p>Identifying cancer driver genes is essential for precision oncology, but existing computational methods are often limited by their reliance on single biological networks and their inability to capture long-range molecular dependencies. To address these challenges, we propose GRAFT, a Graph-Aware Fusion Transformer. This framework learns modality-specific features from protein-protein interactions, pathway co-occurrence, and gene semantic similarity using a multi-view graph encoder. These representations are further enriched with two auxiliary feature types: structural encodings derived from network topology and functional embeddings guided by curated gene sets. The integrated features are then processed by a transformer backbone, where a novel edge-attention bias makes the model explicitly sensitive to the underlying graph topologies, enabling the effective modeling of both local and global dependencies. Extensive evaluations demonstrate that GRAFT achieves competitive performance with leading state-of-the-art methods in pan-cancer analysis, while consistently delivering superior predictive accuracy across numerous specific cancer types. More importantly, a functional enrichment analysis of the novel candidate driver genes predicted by our model confirms their strong associations with key cancer-related processes, demonstrating the model's ability to make biologically plausible discoveries. By delivering a powerful and interpretable framework, our model not only advances the identification of cancer driver genes but also establishes a robust paradigm for multimodal data integration in systems biology. The source codes and datasets are publicly accessible at https://github.com/spcho-dev/GRAFT.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790624/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive survey of genome language models in bioinformatics. 生物信息学中基因组语言模型的综合综述。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf724
Liyuan Shu, Jiao Tang, Xiaoyu Guan, Daoqiang Zhang

Large language models have revolutionized natural language processing by effectively modeling complex semantics and capturing long-range contextual relationships. Inspired by these advancements, genome language models (gLMs) have recently emerged, conceptualizing DNA and RNA sequences as biological texts and enabling the identification of intricate genomic grammar and distant regulatory interactions. This review examines the need for gLMs, emphasizing their capacity to overcome the limitations of traditional deep learning approaches in genomic sequence characterization. We comprehensively survey contemporary gLM architectures, including Transformer models, Hyena convolutions, and state space models, as well as various sequence tokenization strategies, assessing their applicability, and effectiveness across diverse genomic applications. Additionally, we discuss foundational pretraining strategies and provide an overview of genomic pretraining datasets spanning multiple species and functional domains. We critically analyze evaluation methodologies, including supervised, zero-shot, and few-shot learning paradigms, as well as fine-tuning approaches. An extensive taxonomy of downstream tasks is presented, alongside a summary of existing benchmarks and emerging trends. Finally, we contemplate key challenges such as data scarcity, interpretability, and the computational demands of genomic modeling, and propose a roadmap to guide future advances in genome language modeling.

大型语言模型通过有效地建模复杂语义和捕获远程上下文关系,彻底改变了自然语言处理。受这些进步的启发,基因组语言模型(gLMs)最近出现,将DNA和RNA序列概念化为生物学文本,并使复杂的基因组语法和远程调控相互作用得以识别。这篇综述探讨了对glm的需求,强调了它们在基因组序列表征中克服传统深度学习方法局限性的能力。我们全面调查了当代的gLM架构,包括Transformer模型、鬣狗卷积和状态空间模型,以及各种序列标记化策略,评估了它们在不同基因组应用中的适用性和有效性。此外,我们还讨论了基本的预训练策略,并提供了跨多个物种和功能域的基因组预训练数据集的概述。我们批判性地分析评估方法,包括监督、零试和少试学习范式,以及微调方法。介绍了下游任务的广泛分类,以及现有基准和新趋势的摘要。最后,我们展望了基因组建模的关键挑战,如数据稀缺性、可解释性和计算需求,并提出了指导基因组语言建模未来发展的路线图。
{"title":"A comprehensive survey of genome language models in bioinformatics.","authors":"Liyuan Shu, Jiao Tang, Xiaoyu Guan, Daoqiang Zhang","doi":"10.1093/bib/bbaf724","DOIUrl":"10.1093/bib/bbaf724","url":null,"abstract":"<p><p>Large language models have revolutionized natural language processing by effectively modeling complex semantics and capturing long-range contextual relationships. Inspired by these advancements, genome language models (gLMs) have recently emerged, conceptualizing DNA and RNA sequences as biological texts and enabling the identification of intricate genomic grammar and distant regulatory interactions. This review examines the need for gLMs, emphasizing their capacity to overcome the limitations of traditional deep learning approaches in genomic sequence characterization. We comprehensively survey contemporary gLM architectures, including Transformer models, Hyena convolutions, and state space models, as well as various sequence tokenization strategies, assessing their applicability, and effectiveness across diverse genomic applications. Additionally, we discuss foundational pretraining strategies and provide an overview of genomic pretraining datasets spanning multiple species and functional domains. We critically analyze evaluation methodologies, including supervised, zero-shot, and few-shot learning paradigms, as well as fine-tuning approaches. An extensive taxonomy of downstream tasks is presented, alongside a summary of existing benchmarks and emerging trends. Finally, we contemplate key challenges such as data scarcity, interpretability, and the computational demands of genomic modeling, and propose a roadmap to guide future advances in genome language modeling.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DL-GapFilling: a novel deep learning framework for improved plant genome gap filling. dl - gap填充:一种新的深度学习框架,用于改进植物基因组间隙填充。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag007
Yu Chen, Zihao Wang, Gang Wang, Guohua Wang

Genome assembly has been a cornerstone of bioinformatics for decades, with faster and more accurate assembly of unknown genomes remaining a critical challenge. However, genome diversity, structural variations, insufficient sequencing depth, and limitations of current algorithms often lead to numerous gaps during assembly, hindering the construction of high-quality reference genomes. While various assembly methods and software tools have been developed, most exhibit low efficiency in gap filling and fail to account for the intrinsic structural properties of genomic sequences. Here, we present DL-GapFilling, a deep learning-based framework for genome assembly and gap filling. DL-GapFilling leverages a novel Deep Filling Neural Network model to efficiently extract and contextualize flanking sequence information, and incorporates the BeamStar contraction-expand algorithm, which integrates a redefined cost function, an enhanced search strategy, and genomic structural priors to improve both generalization and efficiency in gap filling. In addition, a PredictionFilter mechanism is introduced to selectively retain high-confidence predictions, mitigating the impact of poorly predicted sequences on assembly quality. Experimental results demonstrate that DL-GapFilling significantly improves gap-filling performance across multiple plant or algal genome datasets, achieving increases of 15.6%, 6.1%, 16.7%, 5.5%, and 23.5% in the number of gaps filled compared to traditional tools, and outperforming existing DL-based methods in both efficiency and accuracy. These findings underscore the potential of DL-GapFilling as a powerful tool for advancing genome assembly research.

几十年来,基因组组装一直是生物信息学的基石,更快、更准确地组装未知基因组仍然是一个关键的挑战。然而,基因组多样性、结构变异、测序深度不足以及现有算法的局限性,往往导致组装过程中出现大量空白,阻碍了高质量参考基因组的构建。虽然已经开发了各种组装方法和软件工具,但大多数方法在间隙填充方面效率较低,并且无法考虑基因组序列的内在结构特性。在这里,我们提出了DL-GapFilling,这是一个基于深度学习的基因组组装和间隙填充框架。dl - gap填充利用一种新颖的深度填充神经网络模型来有效地提取和上下文化侧翼序列信息,并结合了BeamStar收缩-扩展算法,该算法集成了重新定义的成本函数、增强的搜索策略和基因组结构先验,以提高间隙填充的泛化和效率。此外,引入了PredictionFilter机制来选择性地保留高置信度的预测,减轻预测不佳的序列对装配质量的影响。实验结果表明,DL-GapFilling显著提高了多个植物或藻类基因组数据集的空白填充性能,与传统工具相比,填补的空白数量分别增加了15.6%、6.1%、16.7%、5.5%和23.5%,在效率和准确性方面均优于现有的基于dl的方法。这些发现强调了dl - gap填充作为推进基因组组装研究的有力工具的潜力。
{"title":"DL-GapFilling: a novel deep learning framework for improved plant genome gap filling.","authors":"Yu Chen, Zihao Wang, Gang Wang, Guohua Wang","doi":"10.1093/bib/bbag007","DOIUrl":"https://doi.org/10.1093/bib/bbag007","url":null,"abstract":"<p><p>Genome assembly has been a cornerstone of bioinformatics for decades, with faster and more accurate assembly of unknown genomes remaining a critical challenge. However, genome diversity, structural variations, insufficient sequencing depth, and limitations of current algorithms often lead to numerous gaps during assembly, hindering the construction of high-quality reference genomes. While various assembly methods and software tools have been developed, most exhibit low efficiency in gap filling and fail to account for the intrinsic structural properties of genomic sequences. Here, we present DL-GapFilling, a deep learning-based framework for genome assembly and gap filling. DL-GapFilling leverages a novel Deep Filling Neural Network model to efficiently extract and contextualize flanking sequence information, and incorporates the BeamStar contraction-expand algorithm, which integrates a redefined cost function, an enhanced search strategy, and genomic structural priors to improve both generalization and efficiency in gap filling. In addition, a PredictionFilter mechanism is introduced to selectively retain high-confidence predictions, mitigating the impact of poorly predicted sequences on assembly quality. Experimental results demonstrate that DL-GapFilling significantly improves gap-filling performance across multiple plant or algal genome datasets, achieving increases of 15.6%, 6.1%, 16.7%, 5.5%, and 23.5% in the number of gaps filled compared to traditional tools, and outperforming existing DL-based methods in both efficiency and accuracy. These findings underscore the potential of DL-GapFilling as a powerful tool for advancing genome assembly research.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146050282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPCRact: a hierarchical framework for predicting ligand-induced GPCR activity via allosteric communication modeling. GPCRact:通过变构通信模型预测配体诱导的GPCR活性的分层框架。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf719
Hyojin Son, Gwan-Su Yi

Accurate prediction of ligand-induced activity for G-protein-coupled receptors (GPCRs) is a cornerstone of drug discovery, yet it is challenged by the need to model allosteric communication-the long-range signaling linking ligand binding to distal conformational changes. Prevailing sequence-based models often fail to capture these three-dimensional dynamics, a limitation frequently masked by averaged performance on simpler Class A targets. To address this, we introduce GPCRact, a novel framework that models the biophysical principles of allosteric modulation in GPCR activation. It first constructs a high-resolution, three-dimensional structure-aware graph from the heavy-atom coordinates of functionally critical residues at binding and allosteric sites. A dual attention architecture then captures the activation process: cross-attention encodes the initial ligand-protein interaction at the binding site, whereas self-attention learns the subsequent intra-protein signal propagation. This hierarchical architecture is built upon an E(n)-Equivariant Graph Neural Network (EGNN) to explicitly model conformational consequences of ligand binding, and is further refined with a tailored loss function and inference logic to mitigate error propagation. Underpinned by GPCRactDB, a comprehensive database we constructed for this study, GPCRact not only achieves state-of-the-art performance but also demonstrates robustly superior accuracy on a curated benchmark of allosterically complex receptors where existing models systematically underperform. Crucially, analysis of the learned attention weights confirms that the model identifies biologically validated allosteric pathways, offering a significant step toward resolving the black box nature of previous methods. Thus, GPCRact provides a more accurate, interpretable, and mechanistically-grounded solution to a long-standing challenge, paving the way for effective structure-guided drug discovery.

准确预测配体诱导的g蛋白偶联受体(gpcr)的活性是药物发现的基石,但它受到变构通信模型的挑战,变构通信是连接配体结合和远端构象变化的远程信号。主流的基于序列的模型常常不能捕捉到这些三维动态,这一限制常常被更简单的a类目标的平均性能所掩盖。为了解决这个问题,我们引入了GPCRact,这是一个新的框架,模拟了GPCR激活中变构调节的生物物理原理。它首先从结合位点和变构位点的功能关键残基的重原子坐标构建了一个高分辨率的三维结构感知图。双注意结构捕获了激活过程:交叉注意编码结合位点的初始配体-蛋白质相互作用,而自注意学习随后的蛋白质内信号传播。这种分层结构建立在E(n)-等变图神经网络(EGNN)的基础上,以明确地模拟配体结合的构象后果,并通过定制的损失函数和推理逻辑进一步改进,以减轻错误传播。在GPCRactDB(我们为本研究构建的一个综合数据库)的支持下,GPCRact不仅实现了最先进的性能,而且在现有模型系统表现不佳的变构复杂受体的精心基准上显示出强大的优越准确性。至关重要的是,对学习到的注意力权重的分析证实了该模型识别了生物学上有效的变构途径,为解决以前方法的黑箱性质提供了重要的一步。因此,GPCRact为长期存在的挑战提供了更准确、可解释和机械基础的解决方案,为有效的结构导向药物发现铺平了道路。
{"title":"GPCRact: a hierarchical framework for predicting ligand-induced GPCR activity via allosteric communication modeling.","authors":"Hyojin Son, Gwan-Su Yi","doi":"10.1093/bib/bbaf719","DOIUrl":"10.1093/bib/bbaf719","url":null,"abstract":"<p><p>Accurate prediction of ligand-induced activity for G-protein-coupled receptors (GPCRs) is a cornerstone of drug discovery, yet it is challenged by the need to model allosteric communication-the long-range signaling linking ligand binding to distal conformational changes. Prevailing sequence-based models often fail to capture these three-dimensional dynamics, a limitation frequently masked by averaged performance on simpler Class A targets. To address this, we introduce GPCRact, a novel framework that models the biophysical principles of allosteric modulation in GPCR activation. It first constructs a high-resolution, three-dimensional structure-aware graph from the heavy-atom coordinates of functionally critical residues at binding and allosteric sites. A dual attention architecture then captures the activation process: cross-attention encodes the initial ligand-protein interaction at the binding site, whereas self-attention learns the subsequent intra-protein signal propagation. This hierarchical architecture is built upon an E(n)-Equivariant Graph Neural Network (EGNN) to explicitly model conformational consequences of ligand binding, and is further refined with a tailored loss function and inference logic to mitigate error propagation. Underpinned by GPCRactDB, a comprehensive database we constructed for this study, GPCRact not only achieves state-of-the-art performance but also demonstrates robustly superior accuracy on a curated benchmark of allosterically complex receptors where existing models systematically underperform. Crucially, analysis of the learned attention weights confirms that the model identifies biologically validated allosteric pathways, offering a significant step toward resolving the black box nature of previous methods. Thus, GPCRact provides a more accurate, interpretable, and mechanistically-grounded solution to a long-standing challenge, paving the way for effective structure-guided drug discovery.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805254/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A systematic review of molecular representation learning foundation models. 分子表征学习基础模型的系统综述。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf703
Bosheng Song, Jiayi Zhang, Ying Liu, Yuansheng Liu, Jing Jiang, Sisi Yuan, Xia Zhen, Yiping Liu

Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.

分子表示学习(MRL)是利用计算方法进行药物发现的基础,能够将分子结构和性质转换为数值向量。这些载体作为机器学习模型的输入,促进了分子属性、功能和反应的预测和分析。基础模型的出现给MRL带来了新的机遇和挑战。这些模型提高了在稀缺数据中的泛化和迁移能力。通过预训练和微调,基础模型可以适应不同的领域。它们强大的编码和生成能力也允许将分子数据转换为更具表现力的形式。本文提供了当前主流分子描述符和数据集的详细回顾,主要集中在小分子的表示,而不包括大分子,如蛋白质和肽。它根据输入形式将基础模型分为两大类:基于单模态的模型和基于多模态的模型。对于每个类别,确定了具有代表性的模型,并评估了它们的优缺点。此外,我们系统地总结了MRL基础模型的四种核心预训练策略,分析了它们的任务设计、适用场景以及对下游性能的影响。此外,还讨论了分子表示基础模型在药物发现和开发中的应用,以及模型可解释性的现状。最后,对MRL基础模型的未来发展方向进行了展望。
{"title":"A systematic review of molecular representation learning foundation models.","authors":"Bosheng Song, Jiayi Zhang, Ying Liu, Yuansheng Liu, Jing Jiang, Sisi Yuan, Xia Zhen, Yiping Liu","doi":"10.1093/bib/bbaf703","DOIUrl":"10.1093/bib/bbaf703","url":null,"abstract":"<p><p>Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784970/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepRMSF: a deep learning-based automated approach for predicting atomic-level flexibility in RNA structure. DeepRMSF:一种基于深度学习的自动化方法,用于预测RNA结构的原子水平灵活性。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf720
Chenjie Feng, Xiaowen Sun, Xintao Song, Lei Bao, Weikang Gong, Renmin Han

Understanding RNA conformational dynamics is essential to understand its roles in complex biological processes. While computational methods have revolutionized the prediction of static 3D RNA structures, predicting local flexibility directly from structure remains a significant challenge. We developed DeepRMSF, a deep learning-based method that leverages atomic-level descriptions of RNA to predict vibrational flexibility given a tertiary structure. Trained on MD-derived root-mean-square fluctuations(RMSF), DeepRMSF was benchmarked on 371 nonredundant RNAs, with 311 RNAs used for five-fold cross-validation (PCC = 0.7219-0.7464) and 60 RNAs as an independent test set (PCC = 0.734), ensuring minimal sequence/structural similarity between sets. DeepRMSF predicts the local flexibility of medium-sized RNAs (~75 nucleotides) in ~8.2 s, achieving >3000-fold speed-up over MD simulations while maintaining strong extrapolative accuracy. Rather than replacing MD, DeepRMSF offers a scalable and practical alternative for transcriptome-scale screening of RNA flexibility, facilitating studies on RNA structure-dynamics-function relationships and supporting computational modeling in RNA biology.

理解RNA构象动力学对于理解其在复杂生物过程中的作用至关重要。虽然计算方法已经彻底改变了静态3D RNA结构的预测,但直接从结构预测局部灵活性仍然是一个重大挑战。我们开发了DeepRMSF,这是一种基于深度学习的方法,利用RNA的原子水平描述来预测给定三级结构的振动灵活性。DeepRMSF在md衍生的均方根波动(RMSF)上进行训练,对371个非冗余rna进行基准测试,其中311个rna用于五重交叉验证(PCC = 0.719 -0.7464), 60个rna作为独立测试集(PCC = 0.734),确保集合之间的序列/结构相似性最小。DeepRMSF在8.2秒内预测中等大小rna(~75个核苷酸)的局部灵活性,在保持很强的外推准确性的同时,实现了比MD模拟快3000倍的速度。DeepRMSF不是取代MD,而是为转录组尺度的RNA灵活性筛选提供了可扩展和实用的替代方案,促进了RNA结构-动力学-功能关系的研究,并支持RNA生物学的计算建模。
{"title":"DeepRMSF: a deep learning-based automated approach for predicting atomic-level flexibility in RNA structure.","authors":"Chenjie Feng, Xiaowen Sun, Xintao Song, Lei Bao, Weikang Gong, Renmin Han","doi":"10.1093/bib/bbaf720","DOIUrl":"10.1093/bib/bbaf720","url":null,"abstract":"<p><p>Understanding RNA conformational dynamics is essential to understand its roles in complex biological processes. While computational methods have revolutionized the prediction of static 3D RNA structures, predicting local flexibility directly from structure remains a significant challenge. We developed DeepRMSF, a deep learning-based method that leverages atomic-level descriptions of RNA to predict vibrational flexibility given a tertiary structure. Trained on MD-derived root-mean-square fluctuations(RMSF), DeepRMSF was benchmarked on 371 nonredundant RNAs, with 311 RNAs used for five-fold cross-validation (PCC = 0.7219-0.7464) and 60 RNAs as an independent test set (PCC = 0.734), ensuring minimal sequence/structural similarity between sets. DeepRMSF predicts the local flexibility of medium-sized RNAs (~75 nucleotides) in ~8.2 s, achieving >3000-fold speed-up over MD simulations while maintaining strong extrapolative accuracy. Rather than replacing MD, DeepRMSF offers a scalable and practical alternative for transcriptome-scale screening of RNA flexibility, facilitating studies on RNA structure-dynamics-function relationships and supporting computational modeling in RNA biology.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798811/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145965339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Turning heterogeneity of statistical epistasis networks to an advantage. 将统计上位网络的异质性转化为优势。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf699
Diane Duroux, Federico Melograna, Héctor Climente-González, Bowen Fan, Andrew Walakira, Edoardo Efrem Gervasoni, Zuqi Li, Damian Roqueiro, Fabio Stella, Kristel Van Steen

Epistasis detection is hindered by multiple challenges, including the proliferation of analytic tools and the diverse methodological choices made in Genome-Wide Association Interaction Studies (GWAIS). These factors often produce inconsistent and only partially overlapping results, with individual methods emphasizing distinct aspects of epistasis. Although comparative evaluations of GWAIS approaches exist, they generally do not identify the factors responsible for methodological discrepancies or assess their implications for biomedical research. Consequently, it remains unclear which features of GWAIS strategies contribute most to these differences and which methods are most appropriate for revealing specific genetic architectures. Here, we present a workflow designed to characterize heterogeneity in GWAIS results and derive practical recommendations systematically. First, we assess non-replicability by comparing single nucleotide polymorphisms-pair rankings and Statistical Epistasis Networks (SENs)-graphs in which nodes represent genetic loci and edges denote epistatic interactions-to identify clusters of protocols with similar outcomes. SENs provide a structured framework for visualizing and comparing variation in epistasis detection, enabling prioritization of interactions recurrently identified across methods. Second, we propose strategies to reduce heterogeneity and enhance robustness, with particular emphasis on interpretability. Notably, we demonstrate that differences among SENs can be informative rather than disadvantageous, as they yield complementary perspectives on disease genetics. Finally, we highlight the benefits of informed SEN aggregation, showing how this approach can strengthen the utility of GWAIS for elucidating biological mechanisms relevant to disease prevention, diagnosis, and management.

上位性检测受到多种挑战的阻碍,包括分析工具的扩散和全基因组关联相互作用研究(GWAIS)中不同的方法选择。这些因素往往产生不一致的,只有部分重叠的结果,个别方法强调不同方面的上位。虽然存在对GWAIS方法的比较评价,但它们通常没有确定导致方法差异的因素或评估其对生物医学研究的影响。因此,目前尚不清楚GWAIS策略的哪些特征对这些差异贡献最大,以及哪些方法最适合揭示特定的遗传结构。在这里,我们提出了一个工作流,旨在表征GWAIS结果的异质性,并系统地得出实用的建议。首先,我们通过比较单核苷酸多态性-对排名和统计上位网络(SENs)-节点代表遗传位点和边缘表示上位相互作用的图-来评估不可复制性,以识别具有相似结果的协议簇。SENs提供了一个结构化的框架,用于可视化和比较上位性检测的变化,从而实现跨方法反复识别的交互优先级。其次,我们提出了减少异质性和增强鲁棒性的策略,特别强调可解释性。值得注意的是,我们证明SENs之间的差异可能是有益的,而不是不利的,因为它们对疾病遗传学产生了互补的观点。最后,我们强调了知情SEN聚合的好处,展示了这种方法如何加强GWAIS在阐明与疾病预防、诊断和管理相关的生物学机制方面的效用。
{"title":"Turning heterogeneity of statistical epistasis networks to an advantage.","authors":"Diane Duroux, Federico Melograna, Héctor Climente-González, Bowen Fan, Andrew Walakira, Edoardo Efrem Gervasoni, Zuqi Li, Damian Roqueiro, Fabio Stella, Kristel Van Steen","doi":"10.1093/bib/bbaf699","DOIUrl":"10.1093/bib/bbaf699","url":null,"abstract":"<p><p>Epistasis detection is hindered by multiple challenges, including the proliferation of analytic tools and the diverse methodological choices made in Genome-Wide Association Interaction Studies (GWAIS). These factors often produce inconsistent and only partially overlapping results, with individual methods emphasizing distinct aspects of epistasis. Although comparative evaluations of GWAIS approaches exist, they generally do not identify the factors responsible for methodological discrepancies or assess their implications for biomedical research. Consequently, it remains unclear which features of GWAIS strategies contribute most to these differences and which methods are most appropriate for revealing specific genetic architectures. Here, we present a workflow designed to characterize heterogeneity in GWAIS results and derive practical recommendations systematically. First, we assess non-replicability by comparing single nucleotide polymorphisms-pair rankings and Statistical Epistasis Networks (SENs)-graphs in which nodes represent genetic loci and edges denote epistatic interactions-to identify clusters of protocols with similar outcomes. SENs provide a structured framework for visualizing and comparing variation in epistasis detection, enabling prioritization of interactions recurrently identified across methods. Second, we propose strategies to reduce heterogeneity and enhance robustness, with particular emphasis on interpretability. Notably, we demonstrate that differences among SENs can be informative rather than disadvantageous, as they yield complementary perspectives on disease genetics. Finally, we highlight the benefits of informed SEN aggregation, showing how this approach can strengthen the utility of GWAIS for elucidating biological mechanisms relevant to disease prevention, diagnosis, and management.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12814973/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146003062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial intelligence in mitotic checkpoint modeling: transforming our understanding of cellular division through machine learning and predictive biology. 有丝分裂检查点建模中的人工智能:通过机器学习和预测生物学改变我们对细胞分裂的理解。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf729
Bashar Ibrahim

Mitotic checkpoints safeguard genomic integrity by orchestrating the precise segregation of chromosomes during cell division. Yet their complex, nonlinear dynamics have long defied full understanding through traditional experimental and computational approaches. In recent years, artificial intelligence (AI) has begun to transform this landscape. Machine learning and deep learning methods now achieve substantial accuracies in predicting cellular behaviors and uncovering novel regulatory mechanisms within checkpoint networks. Advances include transformer architectures capable of predicting spindle assembly checkpoint engagement with >95% accuracy, graph neural networks that decode kinetochore-microtubule dynamics at subpixel resolution, and hybrid AI-mechanistic models that reveal previously hidden feedback circuits. By integrating multi-omics data and bridging molecular mechanisms with clinical applications, AI-driven approaches are opening significant opportunities for precision medicine in cancer and other proliferative diseases. This review synthesizes emerging computational frameworks, highlights transformative AI-driven discoveries, and proposes a roadmap for developing predictive, personalized models of mitotic checkpoint control-charting a path from computational insight to clinical impact.

有丝分裂检查点通过在细胞分裂过程中协调染色体的精确分离来保护基因组的完整性。然而,它们复杂的非线性动力学长期以来无法通过传统的实验和计算方法得到充分的理解。近年来,人工智能(AI)已经开始改变这一格局。机器学习和深度学习方法现在在预测细胞行为和揭示检查点网络中的新调节机制方面取得了相当大的准确性。目前的进展包括能够预测主轴装配检查点啮合的变压器架构,准确度为bb0 95%,以亚像素分辨率解码着丝点-微管动力学的图形神经网络,以及揭示先前隐藏反馈电路的混合ai机制模型。通过整合多组学数据,并将分子机制与临床应用相结合,人工智能驱动的方法为癌症和其他增殖性疾病的精准医疗开辟了重要机遇。这篇综述综合了新兴的计算框架,强调了变革性的人工智能驱动的发现,并提出了一个发展有丝分裂检查点控制的预测性、个性化模型的路线图——绘制了一条从计算洞察力到临床影响的路径。
{"title":"Artificial intelligence in mitotic checkpoint modeling: transforming our understanding of cellular division through machine learning and predictive biology.","authors":"Bashar Ibrahim","doi":"10.1093/bib/bbaf729","DOIUrl":"10.1093/bib/bbaf729","url":null,"abstract":"<p><p>Mitotic checkpoints safeguard genomic integrity by orchestrating the precise segregation of chromosomes during cell division. Yet their complex, nonlinear dynamics have long defied full understanding through traditional experimental and computational approaches. In recent years, artificial intelligence (AI) has begun to transform this landscape. Machine learning and deep learning methods now achieve substantial accuracies in predicting cellular behaviors and uncovering novel regulatory mechanisms within checkpoint networks. Advances include transformer architectures capable of predicting spindle assembly checkpoint engagement with >95% accuracy, graph neural networks that decode kinetochore-microtubule dynamics at subpixel resolution, and hybrid AI-mechanistic models that reveal previously hidden feedback circuits. By integrating multi-omics data and bridging molecular mechanisms with clinical applications, AI-driven approaches are opening significant opportunities for precision medicine in cancer and other proliferative diseases. This review synthesizes emerging computational frameworks, highlights transformative AI-driven discoveries, and proposes a roadmap for developing predictive, personalized models of mitotic checkpoint control-charting a path from computational insight to clinical impact.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805251/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNAttention: an attention-based deep multiple-instance method for uncovering copy number aberration signatures across cancers. CNAttention:一种基于注意力的深度多实例方法,用于发现癌症的拷贝数畸变特征。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf696
Ziying Yang, Michael Baudis

Somatic copy number aberrations (CNAs) represent a distinct class of genomic mutations associated with oncogenetic effects. Over the past three decades, significant volumes of CNA data have been generated through molecular-cytogenetic and genome sequencing-based techniques. These data have been pivotal in identifying cancer-related genes and advancing research on the relationship between CNAs and histopathologically defined cancer types. However, comprehensive studies of CNA landscapes and disease parameters are challenging due to the vast diagnostic and genomic heterogeneity encountered in "pan-cancer" approaches. In this study, we introduce CNAttention, an attention-based deep multiple instance learning method designed to comprehensively analyze CNAs across different cancers and uncover specific CNA patterns within integrated gene-level CNA profiles of 30 cancer types. CNAttention effectively learns CNA features unique to each cancer type and generates CNA signatures for 30 cancer types using attention mechanisms, highlighting the distinctiveness of their CNA landscapes. CNAttention demonstrates high accuracy and exhibits stable performance even with the incorporation of external datasets or parameter adjustments, underscoring its effectiveness in tumor identification. Expanding these signatures to cancer classification trees reveals common patterns not only among physiologically related cancer types but also among clinico-pathologically distant types, such as different cancers originating from neural crest derived cells. Additionally, detected signatures also uncover genomic heterogeneity in individual cancer types, for instance in brain lower grade glioma. Additional experiments with classification models underscore the efficacy of these signatures in representing various cancer types and their potential utility in clinical diagnosis.

体细胞拷贝数畸变(CNAs)代表了一类与致癌效应相关的基因组突变。在过去的三十年中,通过基于分子细胞遗传学和基因组测序的技术产生了大量的CNA数据。这些数据对于识别癌症相关基因和推进CNAs与组织病理学定义的癌症类型之间关系的研究至关重要。然而,由于在“泛癌症”方法中遇到了巨大的诊断和基因组异质性,因此对CNA景观和疾病参数的综合研究具有挑战性。在这项研究中,我们引入了一种基于注意力的深度多实例学习方法CNAttention,旨在全面分析不同癌症的CNA,并在30种癌症类型的综合基因水平CNA谱中揭示特定的CNA模式。CNAttention有效地学习每种癌症类型特有的CNA特征,并使用注意机制为30种癌症类型生成CNA特征,突出其CNA景观的独特性。CNAttention显示出很高的准确性,甚至在合并外部数据集或参数调整时也表现出稳定的性能,强调了其在肿瘤识别中的有效性。将这些特征扩展到癌症分类树中,不仅揭示了生理相关的癌症类型之间的共同模式,而且还揭示了临床病理上遥远的类型之间的共同模式,例如源自神经嵴衍生细胞的不同癌症。此外,检测到的特征也揭示了个体癌症类型的基因组异质性,例如脑低度胶质瘤。分类模型的其他实验强调了这些特征在代表各种癌症类型及其在临床诊断中的潜在效用方面的功效。
{"title":"CNAttention: an attention-based deep multiple-instance method for uncovering copy number aberration signatures across cancers.","authors":"Ziying Yang, Michael Baudis","doi":"10.1093/bib/bbaf696","DOIUrl":"10.1093/bib/bbaf696","url":null,"abstract":"<p><p>Somatic copy number aberrations (CNAs) represent a distinct class of genomic mutations associated with oncogenetic effects. Over the past three decades, significant volumes of CNA data have been generated through molecular-cytogenetic and genome sequencing-based techniques. These data have been pivotal in identifying cancer-related genes and advancing research on the relationship between CNAs and histopathologically defined cancer types. However, comprehensive studies of CNA landscapes and disease parameters are challenging due to the vast diagnostic and genomic heterogeneity encountered in \"pan-cancer\" approaches. In this study, we introduce CNAttention, an attention-based deep multiple instance learning method designed to comprehensively analyze CNAs across different cancers and uncover specific CNA patterns within integrated gene-level CNA profiles of 30 cancer types. CNAttention effectively learns CNA features unique to each cancer type and generates CNA signatures for 30 cancer types using attention mechanisms, highlighting the distinctiveness of their CNA landscapes. CNAttention demonstrates high accuracy and exhibits stable performance even with the incorporation of external datasets or parameter adjustments, underscoring its effectiveness in tumor identification. Expanding these signatures to cancer classification trees reveals common patterns not only among physiologically related cancer types but also among clinico-pathologically distant types, such as different cancers originating from neural crest derived cells. Additionally, detected signatures also uncover genomic heterogeneity in individual cancer types, for instance in brain lower grade glioma. Additional experiments with classification models underscore the efficacy of these signatures in representing various cancer types and their potential utility in clinical diagnosis.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805253/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Briefings in bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1