首页 > 最新文献

Briefings in bioinformatics最新文献

英文 中文
Advances and challenges in single-cell RNA sequencing data analysis: a comprehensive review. 单细胞RNA测序数据分析的进展与挑战:综述。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf723
Ali Mohammad Nesari, Habib MotieGhader, Saeid Ghorbian

Single-cell RNA sequencing (scRNA-seq) has transformed the resolution of cellular heterogeneity, offering insights into dynamic biological processes from tumor evolution to immune regulation. However, its clinical translation is limited by challenges such as data sparsity, batch effects (differences caused by technical variation rather than biology), and the absence of standardized benchmarks for core pipelines like Seurat and Scanpy. This review outlines emerging computational strategies that address these limitations: (A) robust preprocessing, including SCTransform for zero-inflation(an excess of zero counts in gene-expression data) correction and Harmony for batch integration-achieving 30% faster alignment than BBKNN in cohorts exceeding 100,000 cells; (B) transformer-based annotation tools such as scGPT and CellTypist, which reach >95% accuracy in immune profiling using models pretrained on 33 million cells; and (C) multimodal integration with spatial transcriptomics (e.g., 10x Visium, cell2location v2), which delineate microenvironmental niches and rare CX3CR1+ T-cell subsets in disease contexts like glioblastoma and severe COVID-19. We further assess how scANVI bridges scRNA-seq and ATAC-seq to uncover epigenetic mechanisms underlying therapy resistance, and how spatial methods elucidate tumor-immune crosstalk at subcellular resolution. Despite these advances, ethical risks remain, particularly around re-identification of rare patient-derived clones such as pre-metastatic cells. To promote clinical adoption, we propose a roadmap that prioritizes benchmarked workflows (e.g., scverse ecosystem), privacy-aware data sharing via federated learning, and causal AI approaches to disentangle biological signal from technical artifact. By synthesizing computational innovations with translational case studies, this review equips researchers to navigate both the analytical and ethical complexities of scRNA-seq in pursuit of actionable diagnostics.

单细胞RNA测序(scRNA-seq)已经改变了细胞异质性的分辨率,提供了从肿瘤进化到免疫调节的动态生物学过程的见解。然而,其临床转化受到数据稀疏性、批次效应(由技术差异而非生物学引起的差异)以及Seurat和Scanpy等核心管道缺乏标准化基准等挑战的限制。这篇综述概述了解决这些限制的新兴计算策略:(A)稳健的预处理,包括用于零膨胀(基因表达数据中超过零计数)校正的SCTransform和用于批量整合的Harmony,在超过10万个细胞的队列中,比BBKNN的比对速度快30%;(B)基于转换器的注释工具,如scGPT和CellTypist,使用在3300万个细胞上预训练的模型,在免疫谱分析中达到bb0 95%的准确率;(C)与空间转录组学的多模式整合(例如,10x Visium, cell2location v2),描绘了胶质母细胞瘤和严重COVID-19等疾病背景下的微环境利基和罕见的CX3CR1+ t细胞亚群。我们进一步评估了scANVI如何连接scRNA-seq和ATAC-seq来揭示治疗耐药性的表观遗传机制,以及空间方法如何在亚细胞分辨率上阐明肿瘤免疫串扰。尽管取得了这些进展,但伦理风险仍然存在,特别是在重新鉴定罕见的患者来源的克隆(如前转移细胞)方面。为了促进临床应用,我们提出了一个路线图,该路线图优先考虑基准工作流程(例如,横向生态系统),通过联邦学习进行隐私感知数据共享,以及通过因果人工智能方法从技术工件中分离生物信号。通过将计算创新与转化案例研究相结合,本综述使研究人员能够在追求可操作诊断的过程中导航scRNA-seq的分析和伦理复杂性。
{"title":"Advances and challenges in single-cell RNA sequencing data analysis: a comprehensive review.","authors":"Ali Mohammad Nesari, Habib MotieGhader, Saeid Ghorbian","doi":"10.1093/bib/bbaf723","DOIUrl":"10.1093/bib/bbaf723","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) has transformed the resolution of cellular heterogeneity, offering insights into dynamic biological processes from tumor evolution to immune regulation. However, its clinical translation is limited by challenges such as data sparsity, batch effects (differences caused by technical variation rather than biology), and the absence of standardized benchmarks for core pipelines like Seurat and Scanpy. This review outlines emerging computational strategies that address these limitations: (A) robust preprocessing, including SCTransform for zero-inflation(an excess of zero counts in gene-expression data) correction and Harmony for batch integration-achieving 30% faster alignment than BBKNN in cohorts exceeding 100,000 cells; (B) transformer-based annotation tools such as scGPT and CellTypist, which reach >95% accuracy in immune profiling using models pretrained on 33 million cells; and (C) multimodal integration with spatial transcriptomics (e.g., 10x Visium, cell2location v2), which delineate microenvironmental niches and rare CX3CR1+ T-cell subsets in disease contexts like glioblastoma and severe COVID-19. We further assess how scANVI bridges scRNA-seq and ATAC-seq to uncover epigenetic mechanisms underlying therapy resistance, and how spatial methods elucidate tumor-immune crosstalk at subcellular resolution. Despite these advances, ethical risks remain, particularly around re-identification of rare patient-derived clones such as pre-metastatic cells. To promote clinical adoption, we propose a roadmap that prioritizes benchmarked workflows (e.g., scverse ecosystem), privacy-aware data sharing via federated learning, and causal AI approaches to disentangle biological signal from technical artifact. By synthesizing computational innovations with translational case studies, this review equips researchers to navigate both the analytical and ethical complexities of scRNA-seq in pursuit of actionable diagnostics.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12860385/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146096646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Systematic evaluation of computational tools to predict the effects of mutations on protein-ligand binding affinity in the absence of experimental structures. 在没有实验结构的情况下,系统地评估计算工具来预测突变对蛋白质配体结合亲和力的影响。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag035
Qisheng Pan, Stephanie Portelli, Thanh Binh Nguyen, David B Ascher

Drug resistance caused by mutations is a significant global health concern. One way to better understand this phenomenon is by studying changes in protein-ligand binding affinity upon mutation. While recent advances in protein modelling, such as AlphaFold2 and AlphaFold3, have transformed structural assessments, their utility in predicting mutation-induced binding affinity changes remains underexplored. We evaluated various mutation-based methods and scoring functions using computer-generated protein-ligand complexes. Compared to a baseline using experimental structures, we observed a performance drop ranging from 5% to 30% across different computational models. Specifically, using experimental receptors with docked ligands resulted in a ~5% drop, similar to that observed with AlphaFold3 models (~5%), despite the latter offering lower ligand root mean square deviation. However, using AlphaFold2 receptors with docking led to a greater performance loss (10%-20%), comparable to homology models with high sequence identity. Homology models based on low-identity templates showed over 30% decline. These performance differences were most pronounced for interface mutations and low molecular weight ligands. While AlphaFold models offer accurate protein and interaction predictions, they lack mutation-specific information, such as dynamic changes, highlighting the need for complementary mutation-aware methods for reliable analysis. Our findings provide insights into interpreting mutation effects on ligand binding using predicted structures and can guide more robust assessments of drug resistance mechanisms in silico.

突变引起的耐药性是一个重大的全球健康问题。更好地理解这一现象的一种方法是研究突变时蛋白质与配体结合亲和力的变化。虽然最近在蛋白质建模方面的进展,如AlphaFold2和AlphaFold3,已经改变了结构评估,但它们在预测突变诱导的结合亲和力变化方面的效用仍未得到充分探索。我们使用计算机生成的蛋白质配体复合物评估了各种基于突变的方法和评分功能。与使用实验结构的基线相比,我们观察到不同计算模型的性能下降幅度从5%到30%不等。具体来说,使用对接配体的实验受体导致~5%的下降,与AlphaFold3模型相似(~5%),尽管后者提供更低的配体均方根偏差。然而,使用对接的AlphaFold2受体导致更大的性能损失(10%-20%),与具有高序列一致性的同源模型相当。基于低同一性模板的同源模型下降了30%以上。这些性能差异在界面突变和低分子量配体中最为明显。虽然AlphaFold模型提供了准确的蛋白质和相互作用预测,但它们缺乏突变特异性信息,例如动态变化,因此需要补充突变感知方法来进行可靠的分析。我们的研究结果为利用预测结构解释配体结合的突变效应提供了见解,并可以指导更可靠的硅耐药机制评估。
{"title":"Systematic evaluation of computational tools to predict the effects of mutations on protein-ligand binding affinity in the absence of experimental structures.","authors":"Qisheng Pan, Stephanie Portelli, Thanh Binh Nguyen, David B Ascher","doi":"10.1093/bib/bbag035","DOIUrl":"10.1093/bib/bbag035","url":null,"abstract":"<p><p>Drug resistance caused by mutations is a significant global health concern. One way to better understand this phenomenon is by studying changes in protein-ligand binding affinity upon mutation. While recent advances in protein modelling, such as AlphaFold2 and AlphaFold3, have transformed structural assessments, their utility in predicting mutation-induced binding affinity changes remains underexplored. We evaluated various mutation-based methods and scoring functions using computer-generated protein-ligand complexes. Compared to a baseline using experimental structures, we observed a performance drop ranging from 5% to 30% across different computational models. Specifically, using experimental receptors with docked ligands resulted in a ~5% drop, similar to that observed with AlphaFold3 models (~5%), despite the latter offering lower ligand root mean square deviation. However, using AlphaFold2 receptors with docking led to a greater performance loss (10%-20%), comparable to homology models with high sequence identity. Homology models based on low-identity templates showed over 30% decline. These performance differences were most pronounced for interface mutations and low molecular weight ligands. While AlphaFold models offer accurate protein and interaction predictions, they lack mutation-specific information, such as dynamic changes, highlighting the need for complementary mutation-aware methods for reliable analysis. Our findings provide insights into interpreting mutation effects on ligand binding using predicted structures and can guide more robust assessments of drug resistance mechanisms in silico.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Divergent Eurasian ancestry and local adaptation shape the genetic landscapes of the Yugur and Uyghur. 不同的欧亚血统和当地的适应形成了裕固族和维吾尔族的遗传景观。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag041
Siyong Yu, Jia Wen, Yang Gao, Zhaoqing Yang, Xu Wang, Yan Lu, Jiayou Chu, Dilinuer Maimaitiyiming, Shuhua Xu

The Yugur and Uyghur people of northwestern China share documented Early Medieval origins, yet the evolutionary processes that shaped their present-day genomes remain unresolved. Here, we generate high-coverage whole-genome sequences for the Yugurs and compare them with Uyghur genomes to reconstruct their demographic histories, ancestry profiles, and adaptive trajectories. Both groups derive from mixtures of East Eurasian ancestry (EEA) and West Eurasian ancestry (WEA) but in sharply contrasting proportions: the Yugur retain predominantly EEA (~90%), whereas the Uyghur harbor a near-equal balance. Modeling reveals distinct episodes of admixture in Gansu and Xinjiang, with identity-by-descent patterns indicating persistent but substantially reduced genetic continuity (FST = 0.021). Strikingly, despite their EEA-rich background, the Yugur show WEA-shifted allele frequencies at craniofacial loci, including EDAR and LIMS1, suggesting subtle trait convergence. Signals of recent positive selection further differentiate the two populations: the Yugur display strong selection on the FADS locus linked to lipid metabolism, whereas both groups exhibit selection at PPARA but with greater intensity in the Uyghur, consistent with their higher WEA. Functional enrichment analyses highlight overlapping immune and metabolic pathways, consistent with shared biological patterns shaped by demographic history and long-term residence in Northwestern China. Together, these findings show how divergent admixture proportions and region-specific natural selection have produced distinct genomic architectures in two historically related populations along the Silk Road.

居住在中国西北部的裕固族和维吾尔族有共同的中世纪早期起源,但形成他们今天基因组的进化过程仍未得到解决。在这里,我们生成了裕固族的高覆盖全基因组序列,并将其与维吾尔族基因组进行比较,以重建他们的人口历史、祖先概况和适应轨迹。这两个群体都来自东欧亚血统(EEA)和西欧亚血统(WEA)的混合物,但比例截然不同:裕固族主要保留EEA(~90%),而维吾尔族则拥有接近相等的平衡。建模结果显示,甘肃和新疆地区存在明显的混合现象,血统识别模式表明遗传连续性持续存在,但显著降低(FST = 0.021)。引人注目的是,尽管他们拥有丰富的eea背景,但Yugur人在颅面基因座(包括EDAR和LIMS1)上显示了wea移位的等位基因频率,这表明了微妙的性状趋同。最近的阳性选择信号进一步区分了两个种群:裕固族在与脂质代谢相关的FADS位点上表现出强烈的选择,而两组在PPARA上都表现出选择,但维吾尔族的选择强度更大,这与他们较高的WEA一致。功能富集分析强调了免疫和代谢途径的重叠,这与中国西北地区人口历史和长期居住形成的共同生物模式相一致。总之,这些发现显示了不同的混合比例和区域特异性自然选择如何在丝绸之路上两个历史上相关的人群中产生不同的基因组结构。
{"title":"Divergent Eurasian ancestry and local adaptation shape the genetic landscapes of the Yugur and Uyghur.","authors":"Siyong Yu, Jia Wen, Yang Gao, Zhaoqing Yang, Xu Wang, Yan Lu, Jiayou Chu, Dilinuer Maimaitiyiming, Shuhua Xu","doi":"10.1093/bib/bbag041","DOIUrl":"10.1093/bib/bbag041","url":null,"abstract":"<p><p>The Yugur and Uyghur people of northwestern China share documented Early Medieval origins, yet the evolutionary processes that shaped their present-day genomes remain unresolved. Here, we generate high-coverage whole-genome sequences for the Yugurs and compare them with Uyghur genomes to reconstruct their demographic histories, ancestry profiles, and adaptive trajectories. Both groups derive from mixtures of East Eurasian ancestry (EEA) and West Eurasian ancestry (WEA) but in sharply contrasting proportions: the Yugur retain predominantly EEA (~90%), whereas the Uyghur harbor a near-equal balance. Modeling reveals distinct episodes of admixture in Gansu and Xinjiang, with identity-by-descent patterns indicating persistent but substantially reduced genetic continuity (FST = 0.021). Strikingly, despite their EEA-rich background, the Yugur show WEA-shifted allele frequencies at craniofacial loci, including EDAR and LIMS1, suggesting subtle trait convergence. Signals of recent positive selection further differentiate the two populations: the Yugur display strong selection on the FADS locus linked to lipid metabolism, whereas both groups exhibit selection at PPARA but with greater intensity in the Uyghur, consistent with their higher WEA. Functional enrichment analyses highlight overlapping immune and metabolic pathways, consistent with shared biological patterns shaped by demographic history and long-term residence in Northwestern China. Together, these findings show how divergent admixture proportions and region-specific natural selection have produced distinct genomic architectures in two historically related populations along the Silk Road.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12885099/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146149167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An effective fragment-based dual conditional diffusion framework for molecular generation. 一种有效的基于片段的分子生成双条件扩散框架。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf727
Haotian Chen, Yiting Shen, Jichun Li, Weizhong Zhao

Fragment-based molecular generation has emerged as a promising paradigm in structure-based drug design (SBDD), deriving effective compounds with advanced properties, including chemical validity, synthetic feasibility, pharmacological relevance, etc. However, existing approaches often struggle with generating molecules which can both conform to 3D structural constraints and retain chemical plausibility. This is largely due to the fact that prior works often treat scaffolds and R-groups of molecules indiscriminately, overlooking the distinct semantic roles played by scaffolds and R-groups. Specifically, the scaffold serves as the rigid structural backbone that determines the global geometric topology and binding pose, whereas R-groups act as functional substituents responsible for fine-tuning local physicochemical interactions. Therefore, in this work, we propose fragment-based dual conditional diffusion (FDC-Diff), a novel dual conditional diffusion framework that integrates chemical priors and structural cues for fragment-based molecular generation. Unlike traditional de novo methods that generate atoms sequentially, FDC-Diff decomposes the molecule generation process into two semantically complementary stages. Given the protein pocket and an initial fragment, in the first stage, a spatially constrained scaffold is constructed to capture the global molecular topology. In the second stage, R-groups onto the obtained scaffold are elaborated to capture local semantics to further refine molecular properties. To ensure synthetic accessibility, initial fragments and scaffold-modification hierarchy are derived from curated reaction rules, and a physical-chemistry-inspired refinement step is applied to optimize final conformations. Experimental results on multiple SBDD benchmarks demonstrate that FDC-Diff achieves state-of-the-art performance in terms of comprehensive evaluations. Furthermore, our model excels at producing chemically valid, spatially compatible, and pharmacologically relevant molecules, suggesting its potential as a feasible tool for fragment-based drug design.

基于片段的分子生成已成为基于结构的药物设计(SBDD)的一个有前途的范例,它衍生出具有先进性能的有效化合物,包括化学有效性、合成可行性、药理相关性等。然而,现有的方法往往难以产生既符合3D结构约束又保持化学合理性的分子。这在很大程度上是由于先前的研究往往不加区分地对待分子的支架和r -基团,而忽略了支架和r -基团所起的不同的语义作用。具体来说,支架作为刚性结构骨干,决定了整体几何拓扑结构和结合姿态,而r -基团作为功能性取代基,负责微调局部物理化学相互作用。因此,在这项工作中,我们提出了基于片段的双条件扩散(FDC-Diff),这是一种新的双条件扩散框架,集成了基于片段的分子生成的化学先验和结构线索。与传统的按顺序生成原子的从头生成方法不同,FDC-Diff将分子生成过程分解为两个语义互补的阶段。考虑到蛋白质口袋和初始片段,在第一阶段,构建一个空间受限的支架来捕获全局分子拓扑结构。在第二阶段,对获得的支架上的r基团进行阐述,以捕获局部语义,进一步完善分子性质。为了确保合成的可及性,初始片段和支架修饰的层次结构是从精心策划的反应规则中衍生出来的,并采用物理化学启发的改进步骤来优化最终的构象。多个SBDD基准测试的实验结果表明,FDC-Diff在综合评估方面达到了最先进的性能。此外,我们的模型在生产化学上有效、空间相容和药理学相关的分子方面表现出色,这表明它有可能成为基于片段的药物设计的可行工具。
{"title":"An effective fragment-based dual conditional diffusion framework for molecular generation.","authors":"Haotian Chen, Yiting Shen, Jichun Li, Weizhong Zhao","doi":"10.1093/bib/bbaf727","DOIUrl":"10.1093/bib/bbaf727","url":null,"abstract":"<p><p>Fragment-based molecular generation has emerged as a promising paradigm in structure-based drug design (SBDD), deriving effective compounds with advanced properties, including chemical validity, synthetic feasibility, pharmacological relevance, etc. However, existing approaches often struggle with generating molecules which can both conform to 3D structural constraints and retain chemical plausibility. This is largely due to the fact that prior works often treat scaffolds and R-groups of molecules indiscriminately, overlooking the distinct semantic roles played by scaffolds and R-groups. Specifically, the scaffold serves as the rigid structural backbone that determines the global geometric topology and binding pose, whereas R-groups act as functional substituents responsible for fine-tuning local physicochemical interactions. Therefore, in this work, we propose fragment-based dual conditional diffusion (FDC-Diff), a novel dual conditional diffusion framework that integrates chemical priors and structural cues for fragment-based molecular generation. Unlike traditional de novo methods that generate atoms sequentially, FDC-Diff decomposes the molecule generation process into two semantically complementary stages. Given the protein pocket and an initial fragment, in the first stage, a spatially constrained scaffold is constructed to capture the global molecular topology. In the second stage, R-groups onto the obtained scaffold are elaborated to capture local semantics to further refine molecular properties. To ensure synthetic accessibility, initial fragments and scaffold-modification hierarchy are derived from curated reaction rules, and a physical-chemistry-inspired refinement step is applied to optimize final conformations. Experimental results on multiple SBDD benchmarks demonstrate that FDC-Diff achieves state-of-the-art performance in terms of comprehensive evaluations. Furthermore, our model excels at producing chemically valid, spatially compatible, and pharmacologically relevant molecules, suggesting its potential as a feasible tool for fragment-based drug design.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12814976/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146002891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detection of alternative splicing: deep sequencing or deep learning? 选择性剪接检测:深度测序还是深度学习?
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf705
Lena Maria Hackl, Fabian Neuhaus, Sabine Ameling, Uwe Völker, Jan Baumbach, Olga Tsoy

Alternative splicing is a crucial mechanism of gene regulation that enables condition- and tissue-specific expression of gene isoforms. Its dysregulation plays a role in various diseases such as cancer, neurological disorders, and metabolic conditions. Despite its importance, accurate detection of alternative splicing events remains challenging. Comprehensive alternative splicing event detection typically requires deep sequencing with over 100 million reads; however, much of the publicly accessible RNA sequencing data is of lower sequencing depth. Recent advances, particularly deep learning models working with genomic sequences, offer new avenues for predicting alternative splicing without reliance on high sequencing depth data. Our study addresses the question: Can we utilize the vast repository of publicly available RNA sequencing data for comprehensive alternative splicing detection, despite the low sequencing depth? Our results demonstrate the potential of sequence-based deep learning tools such as AlphaGenome, SpliceAI and DeepSplice for initial hypothesis development and as additional filters in standard RNA sequencing pipelines, especially when sequencing depth is limited. Nonetheless, validation with higher sequencing depths remains essential for confirmation of splice events. Overall, our findings underscore the need for integrative methods combining genomic sequence data and RNA sequencing data for the prediction of tissue- and condition-specific alternative splicing in resource-limited settings.

选择性剪接是基因调控的一个关键机制,它使基因同种异构体的条件和组织特异性表达成为可能。它的失调在各种疾病如癌症、神经系统疾病和代谢疾病中起作用。尽管它很重要,但准确检测选择性剪接事件仍然具有挑战性。全面的选择性剪接事件检测通常需要超过1亿reads的深度测序;然而,许多可公开访问的RNA测序数据的测序深度较低。最近的进展,特别是与基因组序列一起工作的深度学习模型,为预测选择性剪接提供了新的途径,而不依赖于高测序深度数据。我们的研究解决了这样一个问题:尽管测序深度较低,但我们能否利用大量公开可用的RNA测序数据进行全面的替代剪接检测?我们的研究结果证明了基于序列的深度学习工具(如AlphaGenome、SpliceAI和DeepSplice)在初始假设开发和标准RNA测序管道中的附加过滤器方面的潜力,特别是在测序深度有限的情况下。尽管如此,更高测序深度的验证仍然是确认剪接事件的必要条件。总的来说,我们的研究结果强调了在资源有限的环境下,需要将基因组序列数据和RNA测序数据结合起来的综合方法来预测组织和条件特异性的选择性剪接。
{"title":"Detection of alternative splicing: deep sequencing or deep learning?","authors":"Lena Maria Hackl, Fabian Neuhaus, Sabine Ameling, Uwe Völker, Jan Baumbach, Olga Tsoy","doi":"10.1093/bib/bbaf705","DOIUrl":"10.1093/bib/bbaf705","url":null,"abstract":"<p><p>Alternative splicing is a crucial mechanism of gene regulation that enables condition- and tissue-specific expression of gene isoforms. Its dysregulation plays a role in various diseases such as cancer, neurological disorders, and metabolic conditions. Despite its importance, accurate detection of alternative splicing events remains challenging. Comprehensive alternative splicing event detection typically requires deep sequencing with over 100 million reads; however, much of the publicly accessible RNA sequencing data is of lower sequencing depth. Recent advances, particularly deep learning models working with genomic sequences, offer new avenues for predicting alternative splicing without reliance on high sequencing depth data. Our study addresses the question: Can we utilize the vast repository of publicly available RNA sequencing data for comprehensive alternative splicing detection, despite the low sequencing depth? Our results demonstrate the potential of sequence-based deep learning tools such as AlphaGenome, SpliceAI and DeepSplice for initial hypothesis development and as additional filters in standard RNA sequencing pipelines, especially when sequencing depth is limited. Nonetheless, validation with higher sequencing depths remains essential for confirmation of splice events. Overall, our findings underscore the need for integrative methods combining genomic sequence data and RNA sequencing data for the prediction of tissue- and condition-specific alternative splicing in resource-limited settings.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790623/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ANIA: an inception-attention network for predicting minimum inhibitory concentration of antimicrobial peptides. ANIA:用于预测抗菌肽最小抑制浓度的起始-注意网络。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag023
Yen-Peng Chiu, Lantian Yao, Yun Tang, Chia-Ru Chung, Yuxuan Pang, Ying-Chih Chiang, Tzong-Yi Lee

Antimicrobial resistance poses a significant challenge to conventional antibiotics, underscoring the urgent need for alternative therapeutic strategies. Antimicrobial peptides (AMPs) have emerged as promising candidates due to their broad-spectrum antibacterial activity and distinct mechanisms of action. This study presents ANIA, a deep learning framework developed to predict the minimum inhibitory concentration (MIC) values of AMPs against three clinically significant bacteria: Staphylococcus aureus, Escherichia coli, and Pseudomonas aeruginosa. ANIA leverages Chaos Game Representation (CGR) to transform AMP sequences into frequency-based image features, which are subsequently processed through a hybrid architecture comprising stacked Inception modules, a Transformer encoder, and a regression head. This integrative architecture enables ANIA to capture both local motif-based features and global contextual patterns embedded within AMP sequences. In benchmarking experiments, ANIA achieved notably superior performance compared to existing tools, including ESKAPEE-Pred, AMPActiPred, and esAMPMIC, achieving higher correlation coefficients and lower predictive errors across all bacteria targets, with the most pronounced improvement observed for P. aeruginosa, a pathogen renowned for its multidrug resistance. Specifically, ANIA achieved PCCs of 0.75-0.79 and MSEs of 0.23-0.26 across all species. Furthermore, motif-based interpretability analyses combining Grad-CAM visualizations, correlation heatmaps, motif frequency distributions, and hydrophobicity profiling revealed biologically meaningful subregions within the CGR matrix that are plausibly associated with antimicrobial efficacy. In conclusion, this study develops ANIA as a robust predictive tool for MIC estimation, offering valuable insights into the design of effective antimicrobial agents and contributing to the fight against antimicrobial resistance. A user-friendly web server for ANIA is available at https://biomics.lab.nycu.edu.tw/ANIA/.

抗菌素耐药性对传统抗生素构成重大挑战,强调迫切需要替代治疗策略。抗菌肽(AMPs)由于其广谱抗菌活性和独特的作用机制而成为有希望的候选者。本研究提出了ANIA,这是一个深度学习框架,用于预测抗菌肽对三种临床重要细菌的最低抑制浓度(MIC)值:金黄色葡萄球菌、大肠杆菌和铜绿假单胞菌。ANIA利用混沌游戏表示(CGR)将AMP序列转换为基于频率的图像特征,随后通过由堆叠的Inception模块、Transformer编码器和回归头组成的混合架构进行处理。这种集成的架构使ANIA能够捕获本地基于图案的特征和嵌入在AMP序列中的全局上下文模式。在基准测试实验中,与eskapape - pred、AMPActiPred和esAMPMIC等现有工具相比,ANIA取得了显著的优异性能,在所有细菌靶标上实现了更高的相关系数和更低的预测误差,其中对铜绿假单胞菌(P. aeruginosa,一种以多药耐药而闻名的病原体)的改善最为显著。具体而言,ANIA所有物种的PCCs为0.75-0.79,mse为0.23-0.26。此外,基于基序的可解释性分析结合了Grad-CAM可视化、相关热图、基序频率分布和疏水性分析,揭示了CGR矩阵中具有生物学意义的亚区,这些亚区可能与抗菌功效相关。总之,本研究将ANIA发展为MIC估计的强大预测工具,为有效抗菌药物的设计提供了有价值的见解,并有助于对抗抗菌药物耐药性。一个用户友好的网络服务器可以在https://biomics.lab.nycu.edu.tw/ANIA/上找到。
{"title":"ANIA: an inception-attention network for predicting minimum inhibitory concentration of antimicrobial peptides.","authors":"Yen-Peng Chiu, Lantian Yao, Yun Tang, Chia-Ru Chung, Yuxuan Pang, Ying-Chih Chiang, Tzong-Yi Lee","doi":"10.1093/bib/bbag023","DOIUrl":"10.1093/bib/bbag023","url":null,"abstract":"<p><p>Antimicrobial resistance poses a significant challenge to conventional antibiotics, underscoring the urgent need for alternative therapeutic strategies. Antimicrobial peptides (AMPs) have emerged as promising candidates due to their broad-spectrum antibacterial activity and distinct mechanisms of action. This study presents ANIA, a deep learning framework developed to predict the minimum inhibitory concentration (MIC) values of AMPs against three clinically significant bacteria: Staphylococcus aureus, Escherichia coli, and Pseudomonas aeruginosa. ANIA leverages Chaos Game Representation (CGR) to transform AMP sequences into frequency-based image features, which are subsequently processed through a hybrid architecture comprising stacked Inception modules, a Transformer encoder, and a regression head. This integrative architecture enables ANIA to capture both local motif-based features and global contextual patterns embedded within AMP sequences. In benchmarking experiments, ANIA achieved notably superior performance compared to existing tools, including ESKAPEE-Pred, AMPActiPred, and esAMPMIC, achieving higher correlation coefficients and lower predictive errors across all bacteria targets, with the most pronounced improvement observed for P. aeruginosa, a pathogen renowned for its multidrug resistance. Specifically, ANIA achieved PCCs of 0.75-0.79 and MSEs of 0.23-0.26 across all species. Furthermore, motif-based interpretability analyses combining Grad-CAM visualizations, correlation heatmaps, motif frequency distributions, and hydrophobicity profiling revealed biologically meaningful subregions within the CGR matrix that are plausibly associated with antimicrobial efficacy. In conclusion, this study develops ANIA as a robust predictive tool for MIC estimation, offering valuable insights into the design of effective antimicrobial agents and contributing to the fight against antimicrobial resistance. A user-friendly web server for ANIA is available at https://biomics.lab.nycu.edu.tw/ANIA/.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12895073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146149120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GATCL: graph attention network meets contrastive learning for spatial domain identification. GATCL:图注意网络满足空间域识别的对比学习。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag043
Jichong Mu, Yachen Yao, Qiuhao Chen, Jiqiu Sun, Tianyi Zhao

Spatial domain identification is an essential task for revealing spatial heterogeneity within tissues, providing insights into disease mechanisms, tissue development, and the cellular microenvironment. In recent years, spatial multi-omics has emerged as the new frontier in spatial domain identification that offers deeper insights into the complex interplay and functional dynamics of heterogeneous cell communities within their native tissue context. Most existing methods rely on static graph structures that treat all neighboring cells uniformly, failing to capture the nuanced cellular interactions within the microenvironment and thus blurring functional boundaries. Furthermore, cross-modal reconstruction performance is often degraded by overfitting to modality-specific noise, which may impair the precise delineation of spatial domains. Therefore, we present GATCL, a novel deep learning framework that integrates a graph attention network with contrastive learning (CL) for robust spatial domain identification. First, GATCL leverages the graph attention mechanism to dynamically assign weights to neighboring spots, adaptively modeling the complex cellular architecture. Second, it implements a cross-modal CL strategy that forces representations from the same spatial location to be similar while pushing those from different locations apart, thereby achieving robust alignment between modalities. Comprehensive experiments across six distinct datasets (spanning transcriptome, proteome, and chromatin) reveal that GATCL is superior to seven representative methods across six key evaluation metrics.

空间域识别是揭示组织内部空间异质性的重要任务,为了解疾病机制、组织发育和细胞微环境提供了重要依据。近年来,空间多组学已成为空间域识别的新前沿,为异质细胞群落在其原生组织环境下的复杂相互作用和功能动态提供了更深入的见解。大多数现有的方法依赖于静态图形结构,这些结构均匀地对待所有相邻的细胞,无法捕捉微环境中细微的细胞相互作用,从而模糊了功能边界。此外,交叉模态重构性能往往会因模态特定噪声的过度拟合而降低,这可能会损害空间域的精确描绘。因此,我们提出了一种新的深度学习框架GATCL,它将图注意网络与对比学习(CL)相结合,用于鲁棒的空间域识别。首先,GATCL利用图注意机制动态地为相邻点分配权重,自适应地对复杂的细胞结构进行建模。其次,它实现了跨模态CL策略,该策略强制来自相同空间位置的表示相似,同时将来自不同位置的表示分开,从而实现模态之间的稳健对齐。跨六个不同的数据集(跨越转录组,蛋白质组和染色质)的综合实验表明,GATCL在六个关键评估指标上优于七个代表性方法。
{"title":"GATCL: graph attention network meets contrastive learning for spatial domain identification.","authors":"Jichong Mu, Yachen Yao, Qiuhao Chen, Jiqiu Sun, Tianyi Zhao","doi":"10.1093/bib/bbag043","DOIUrl":"10.1093/bib/bbag043","url":null,"abstract":"<p><p>Spatial domain identification is an essential task for revealing spatial heterogeneity within tissues, providing insights into disease mechanisms, tissue development, and the cellular microenvironment. In recent years, spatial multi-omics has emerged as the new frontier in spatial domain identification that offers deeper insights into the complex interplay and functional dynamics of heterogeneous cell communities within their native tissue context. Most existing methods rely on static graph structures that treat all neighboring cells uniformly, failing to capture the nuanced cellular interactions within the microenvironment and thus blurring functional boundaries. Furthermore, cross-modal reconstruction performance is often degraded by overfitting to modality-specific noise, which may impair the precise delineation of spatial domains. Therefore, we present GATCL, a novel deep learning framework that integrates a graph attention network with contrastive learning (CL) for robust spatial domain identification. First, GATCL leverages the graph attention mechanism to dynamically assign weights to neighboring spots, adaptively modeling the complex cellular architecture. Second, it implements a cross-modal CL strategy that forces representations from the same spatial location to be similar while pushing those from different locations apart, thereby achieving robust alignment between modalities. Comprehensive experiments across six distinct datasets (spanning transcriptome, proteome, and chromatin) reveal that GATCL is superior to seven representative methods across six key evaluation metrics.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12900075/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146177761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
iDLDDG: predicting protein stability changes from missense mutations in DNA-binding proteins using integrated deep learning features. iDLDDG:利用综合深度学习特征预测dna结合蛋白错义突变引起的蛋白质稳定性变化。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag050
Xuan Yu, Fang Ge, Dong-Jun Yu, Zhaohong Deng

To understand disease mechanisms and advance therapies, accurately predicting how missense mutations alter protein-DNA binding affinity is critical. Many existing models neglect the unique characteristics of missense mutations in both double-stranded DNA-binding proteins (DSBs) and single-stranded DNA-binding proteins (SSBs). To address this issue, we constructed a comprehensive dataset from diverse sources. By leveraging sequence-based embeddings from pretrained protein language models including ESM2, ProtTrans, and ESM1v, we developed iDLDDG, a deep learning framework that integrates multi-scale structural and evolutionary information via a multi-channel architecture. To balance residue-wise information density against entropy, our entropy-based algorithm determined 181 residues as optimal for modeling biophysical constraints. This approach enhances predictive accuracy and computational efficiency, thereby supporting large-scale assessments of mutation effects in DNA-binding proteins. iDLDDG achieves state-of-the-art performance, with a 10-fold cross-validation PCC of 0.755 on MPD276 and 0.632 on independent test sets encompassing both DSBs and SSBs, significantly surpassing existing methods. By establishing the first computational framework that rigorously differentiates DSB and SSB mutation mechanisms, our work provides a foundation for high-accuracy prediction of pathological mutations in DNA-binding proteins.

为了了解疾病机制和推进治疗,准确预测错义突变如何改变蛋白质- dna结合亲和力至关重要。许多现有模型忽略了双链dna结合蛋白(DSBs)和单链dna结合蛋白(SSBs)错义突变的独特特征。为了解决这个问题,我们从不同的来源构建了一个全面的数据集。通过利用来自预训练蛋白质语言模型(包括ESM2、ProtTrans和ESM1v)的基于序列的嵌入,我们开发了iDLDDG,这是一个深度学习框架,通过多渠道架构集成了多尺度结构和进化信息。为了平衡残基信息密度和熵,我们基于熵的算法确定了181个残基作为建模生物物理约束的最佳选择。该方法提高了预测准确性和计算效率,从而支持dna结合蛋白突变效应的大规模评估。iDLDDG实现了最先进的性能,在包含dsb和ssb的独立测试集上,MPD276和0.632的交叉验证PCC分别为0.755和0.632,显著优于现有方法。通过建立第一个严格区分DSB和SSB突变机制的计算框架,我们的工作为高精度预测dna结合蛋白的病理突变提供了基础。
{"title":"iDLDDG: predicting protein stability changes from missense mutations in DNA-binding proteins using integrated deep learning features.","authors":"Xuan Yu, Fang Ge, Dong-Jun Yu, Zhaohong Deng","doi":"10.1093/bib/bbag050","DOIUrl":"10.1093/bib/bbag050","url":null,"abstract":"<p><p>To understand disease mechanisms and advance therapies, accurately predicting how missense mutations alter protein-DNA binding affinity is critical. Many existing models neglect the unique characteristics of missense mutations in both double-stranded DNA-binding proteins (DSBs) and single-stranded DNA-binding proteins (SSBs). To address this issue, we constructed a comprehensive dataset from diverse sources. By leveraging sequence-based embeddings from pretrained protein language models including ESM2, ProtTrans, and ESM1v, we developed iDLDDG, a deep learning framework that integrates multi-scale structural and evolutionary information via a multi-channel architecture. To balance residue-wise information density against entropy, our entropy-based algorithm determined 181 residues as optimal for modeling biophysical constraints. This approach enhances predictive accuracy and computational efficiency, thereby supporting large-scale assessments of mutation effects in DNA-binding proteins. iDLDDG achieves state-of-the-art performance, with a 10-fold cross-validation PCC of 0.755 on MPD276 and 0.632 on independent test sets encompassing both DSBs and SSBs, significantly surpassing existing methods. By establishing the first computational framework that rigorously differentiates DSB and SSB mutation mechanisms, our work provides a foundation for high-accuracy prediction of pathological mutations in DNA-binding proteins.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12903960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146194033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChemEmbed: a deep learning framework for metabolite identification using enhanced MS/MS data and multidimensional molecular embeddings. ChemEmbed:使用增强的MS/MS数据和多维分子嵌入进行代谢物鉴定的深度学习框架。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbag054
Muhammad Faizan-Khan, Roger Giné, Josep M Badia, Maribel Pérez-Ribera, Manuel Ruiz-Botella, Alexandra Junza, Jordi Capellades, Iván Pérez-López, Shipei Xing, Abubaker Patan, Laura Brugnara, Anna Novials, Joan-Marc Servitja, Maria Vinaixa, Pieter C Dorrestein, Marta Sales-Pardo, Roger Guimerà, Oscar Yanes

Machine learning offers a promising path to annotating the large number of unidentified MS/MS spectra in metabolomics, addressing the limited coverage of current reference spectral libraries. However, existing methods often struggle with the high dimensionality and sparsity of MS/MS spectra and metabolite structures. ChemEmbed tackles these challenges by integrating multidimensional, continuous vector representations of chemical structures with enhanced MS/MS spectra. This enhancement is achieved by merging spectra across multiple collision energies and incorporating calculated neutral losses from 38 472 distinct compounds, providing richer input for a convolutional neural network (CNN). ChemEmbed ranks the correct candidate first in over 42% of cases and within the top five in more than 76% of cases. In external benchmarks such as CASMI 2016 and 2022, ChemEmbed outperforms SIRIUS 6, the current state-of-the-art in computational metabolomics. We applied ChemEmbed to predict structures in the Annotated Recurrent Unidentified Spectra (ARUS) dataset and confirmed 25 previously unidentified compounds. These findings demonstrate ChemEmbed's potential as a robust, scalable tool for accelerating metabolite identification in untargeted mass spectrometry workflows.

机器学习为代谢组学中大量未识别的MS/MS谱提供了一条有前途的途径,解决了当前参考谱库覆盖范围有限的问题。然而,现有的方法往往与MS/MS光谱和代谢物结构的高维数和稀疏度相斗争。ChemEmbed通过将化学结构的多维、连续矢量表示与增强的MS/MS谱相结合,解决了这些挑战。这种增强是通过合并多个碰撞能量的光谱,并结合38472种不同化合物的计算中性损失来实现的,为卷积神经网络(CNN)提供了更丰富的输入。ChemEmbed在超过42%的案例中将正确的候选人排在第一,在超过76%的案例中将其排在前五名。在CASMI 2016和2022等外部基准测试中,ChemEmbed的表现优于当前最先进的计算代谢组学天狼星6。我们应用ChemEmbed预测了ARUS数据集中的结构,并确认了25个以前未识别的化合物。这些发现证明了ChemEmbed作为一种强大的、可扩展的工具,在非靶向质谱工作流程中加速代谢物鉴定的潜力。
{"title":"ChemEmbed: a deep learning framework for metabolite identification using enhanced MS/MS data and multidimensional molecular embeddings.","authors":"Muhammad Faizan-Khan, Roger Giné, Josep M Badia, Maribel Pérez-Ribera, Manuel Ruiz-Botella, Alexandra Junza, Jordi Capellades, Iván Pérez-López, Shipei Xing, Abubaker Patan, Laura Brugnara, Anna Novials, Joan-Marc Servitja, Maria Vinaixa, Pieter C Dorrestein, Marta Sales-Pardo, Roger Guimerà, Oscar Yanes","doi":"10.1093/bib/bbag054","DOIUrl":"10.1093/bib/bbag054","url":null,"abstract":"<p><p>Machine learning offers a promising path to annotating the large number of unidentified MS/MS spectra in metabolomics, addressing the limited coverage of current reference spectral libraries. However, existing methods often struggle with the high dimensionality and sparsity of MS/MS spectra and metabolite structures. ChemEmbed tackles these challenges by integrating multidimensional, continuous vector representations of chemical structures with enhanced MS/MS spectra. This enhancement is achieved by merging spectra across multiple collision energies and incorporating calculated neutral losses from 38 472 distinct compounds, providing richer input for a convolutional neural network (CNN). ChemEmbed ranks the correct candidate first in over 42% of cases and within the top five in more than 76% of cases. In external benchmarks such as CASMI 2016 and 2022, ChemEmbed outperforms SIRIUS 6, the current state-of-the-art in computational metabolomics. We applied ChemEmbed to predict structures in the Annotated Recurrent Unidentified Spectra (ARUS) dataset and confirmed 25 previously unidentified compounds. These findings demonstrate ChemEmbed's potential as a robust, scalable tool for accelerating metabolite identification in untargeted mass spectrometry workflows.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12903953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146194080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel two-sample Mendelian randomization framework integrating common and rare variants: application to assess the effect of HDL-C on preeclampsia risk. 整合常见和罕见变异的新型双样本孟德尔随机化框架:用于评估HDL-C对子痫前期风险的影响。
IF 7.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-07 DOI: 10.1093/bib/bbaf649
Yu Zhang, Ming Li, David M Haas, C Noel Bairey Merz, Tsegaselassie Workalemahu, Kelli Ryckman, Janet M Catov, Lisa D Levine, Alexa Freedman, George R Saade, Jiaqi Hu, Hongyu Zhao, Xihao Li, Nianjun Liu, Qi Yan

Mendelian randomization (MR) has become an important technique for establishing causal relationships between risk factors and health outcomes. By using genetic variants as instrumental variables, it can mitigate bias due to confounding and reverse causation in observational studies. Current MR analyses have predominantly used common genetic variants as instruments, which represent only part of the genetic architecture of complex traits. Rare variants, which can have larger effect sizes and provide unique biological insights, have been understudied due to statistical and methodological challenges. We introduce MR-common and annotation-informed rare variants (MR-CARV), a novel framework integrating common and rare genetic variants in two-sample MR. This method leverages comprehensive genetic data made available by high-throughput sequencing technologies and large-scale consortia. Rare variants are aggregated into functional categories, such as gene-coding, gene-noncoding, and nongene regions, by leveraging variant annotations and biological impact as weights. The effects of rare variant sets are then estimated with STAARpipeline and combined with the estimated effects of common variants by the existing MR methods. Simulation studies demonstrate that MR-CARV maintains robust type I error and achieves higher statistical power, with up to a 66.3% relative increase compared with existing methods only based on common variants. Consistent with these findings, application to real data on high-density lipoprotein cholesterol (HDL-C) and preeclampsia showed that MR-CARV [inverse variance weighted (IVW)] yielded a more precise and statistically significant effect estimate (-0.020, SE = 0.0102, $P$ =.0470) than IVW using only common variants (-0.023, SE = 0.0123, $P$ =.0659).

孟德尔随机化(MR)已成为建立危险因素与健康结果之间因果关系的重要技术。通过使用遗传变异作为工具变量,可以减轻观察性研究中由于混杂和反向因果关系而产生的偏倚。目前的MR分析主要使用常见的遗传变异作为工具,这只代表了复杂性状的部分遗传结构。由于统计和方法上的挑战,罕见的变异,可以有更大的效应大小和提供独特的生物学见解,一直没有得到充分的研究。我们介绍了MR-common和annotation-informed rare variant (MR-CARV),这是一种在两样本mr中整合常见和罕见遗传变异的新框架。这种方法利用了高通量测序技术和大规模联盟提供的全面遗传数据。通过利用变异注释和生物影响作为权重,将罕见的变异聚合到功能类别中,例如基因编码、基因非编码和非基因区域。然后利用STAARpipeline估计罕见变异集的影响,并结合现有MR方法估计常见变异集的影响。仿真研究表明,MR-CARV保持了鲁棒的I型误差,并获得了更高的统计功率,与仅基于常见变量的现有方法相比,相对提高了66.3%。与这些发现一致的是,将MR-CARV[逆方差加权(IVW)]应用于高密度脂蛋白胆固醇(HDL-C)和子痫前期的真实数据显示,MR-CARV[逆方差加权(IVW)]比仅使用常见变异的IVW (-0.023, SE = 0.0123, P$ = 0.059)产生了更精确且具有统计学意义的效应估计(-0.020,SE = 0.0102, $P$ = 0.0470)。
{"title":"A novel two-sample Mendelian randomization framework integrating common and rare variants: application to assess the effect of HDL-C on preeclampsia risk.","authors":"Yu Zhang, Ming Li, David M Haas, C Noel Bairey Merz, Tsegaselassie Workalemahu, Kelli Ryckman, Janet M Catov, Lisa D Levine, Alexa Freedman, George R Saade, Jiaqi Hu, Hongyu Zhao, Xihao Li, Nianjun Liu, Qi Yan","doi":"10.1093/bib/bbaf649","DOIUrl":"10.1093/bib/bbaf649","url":null,"abstract":"<p><p>Mendelian randomization (MR) has become an important technique for establishing causal relationships between risk factors and health outcomes. By using genetic variants as instrumental variables, it can mitigate bias due to confounding and reverse causation in observational studies. Current MR analyses have predominantly used common genetic variants as instruments, which represent only part of the genetic architecture of complex traits. Rare variants, which can have larger effect sizes and provide unique biological insights, have been understudied due to statistical and methodological challenges. We introduce MR-common and annotation-informed rare variants (MR-CARV), a novel framework integrating common and rare genetic variants in two-sample MR. This method leverages comprehensive genetic data made available by high-throughput sequencing technologies and large-scale consortia. Rare variants are aggregated into functional categories, such as gene-coding, gene-noncoding, and nongene regions, by leveraging variant annotations and biological impact as weights. The effects of rare variant sets are then estimated with STAARpipeline and combined with the estimated effects of common variants by the existing MR methods. Simulation studies demonstrate that MR-CARV maintains robust type I error and achieves higher statistical power, with up to a 66.3% relative increase compared with existing methods only based on common variants. Consistent with these findings, application to real data on high-density lipoprotein cholesterol (HDL-C) and preeclampsia showed that MR-CARV [inverse variance weighted (IVW)] yielded a more precise and statistically significant effect estimate (-0.020, SE = 0.0102, $P$ =.0470) than IVW using only common variants (-0.023, SE = 0.0123, $P$ =.0659).</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777983/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Briefings in bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1