Ali Mohammad Nesari, Habib MotieGhader, Saeid Ghorbian
Single-cell RNA sequencing (scRNA-seq) has transformed the resolution of cellular heterogeneity, offering insights into dynamic biological processes from tumor evolution to immune regulation. However, its clinical translation is limited by challenges such as data sparsity, batch effects (differences caused by technical variation rather than biology), and the absence of standardized benchmarks for core pipelines like Seurat and Scanpy. This review outlines emerging computational strategies that address these limitations: (A) robust preprocessing, including SCTransform for zero-inflation(an excess of zero counts in gene-expression data) correction and Harmony for batch integration-achieving 30% faster alignment than BBKNN in cohorts exceeding 100,000 cells; (B) transformer-based annotation tools such as scGPT and CellTypist, which reach >95% accuracy in immune profiling using models pretrained on 33 million cells; and (C) multimodal integration with spatial transcriptomics (e.g., 10x Visium, cell2location v2), which delineate microenvironmental niches and rare CX3CR1+ T-cell subsets in disease contexts like glioblastoma and severe COVID-19. We further assess how scANVI bridges scRNA-seq and ATAC-seq to uncover epigenetic mechanisms underlying therapy resistance, and how spatial methods elucidate tumor-immune crosstalk at subcellular resolution. Despite these advances, ethical risks remain, particularly around re-identification of rare patient-derived clones such as pre-metastatic cells. To promote clinical adoption, we propose a roadmap that prioritizes benchmarked workflows (e.g., scverse ecosystem), privacy-aware data sharing via federated learning, and causal AI approaches to disentangle biological signal from technical artifact. By synthesizing computational innovations with translational case studies, this review equips researchers to navigate both the analytical and ethical complexities of scRNA-seq in pursuit of actionable diagnostics.
{"title":"Advances and challenges in single-cell RNA sequencing data analysis: a comprehensive review.","authors":"Ali Mohammad Nesari, Habib MotieGhader, Saeid Ghorbian","doi":"10.1093/bib/bbaf723","DOIUrl":"10.1093/bib/bbaf723","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) has transformed the resolution of cellular heterogeneity, offering insights into dynamic biological processes from tumor evolution to immune regulation. However, its clinical translation is limited by challenges such as data sparsity, batch effects (differences caused by technical variation rather than biology), and the absence of standardized benchmarks for core pipelines like Seurat and Scanpy. This review outlines emerging computational strategies that address these limitations: (A) robust preprocessing, including SCTransform for zero-inflation(an excess of zero counts in gene-expression data) correction and Harmony for batch integration-achieving 30% faster alignment than BBKNN in cohorts exceeding 100,000 cells; (B) transformer-based annotation tools such as scGPT and CellTypist, which reach >95% accuracy in immune profiling using models pretrained on 33 million cells; and (C) multimodal integration with spatial transcriptomics (e.g., 10x Visium, cell2location v2), which delineate microenvironmental niches and rare CX3CR1+ T-cell subsets in disease contexts like glioblastoma and severe COVID-19. We further assess how scANVI bridges scRNA-seq and ATAC-seq to uncover epigenetic mechanisms underlying therapy resistance, and how spatial methods elucidate tumor-immune crosstalk at subcellular resolution. Despite these advances, ethical risks remain, particularly around re-identification of rare patient-derived clones such as pre-metastatic cells. To promote clinical adoption, we propose a roadmap that prioritizes benchmarked workflows (e.g., scverse ecosystem), privacy-aware data sharing via federated learning, and causal AI approaches to disentangle biological signal from technical artifact. By synthesizing computational innovations with translational case studies, this review equips researchers to navigate both the analytical and ethical complexities of scRNA-seq in pursuit of actionable diagnostics.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12860385/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146096646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qisheng Pan, Stephanie Portelli, Thanh Binh Nguyen, David B Ascher
Drug resistance caused by mutations is a significant global health concern. One way to better understand this phenomenon is by studying changes in protein-ligand binding affinity upon mutation. While recent advances in protein modelling, such as AlphaFold2 and AlphaFold3, have transformed structural assessments, their utility in predicting mutation-induced binding affinity changes remains underexplored. We evaluated various mutation-based methods and scoring functions using computer-generated protein-ligand complexes. Compared to a baseline using experimental structures, we observed a performance drop ranging from 5% to 30% across different computational models. Specifically, using experimental receptors with docked ligands resulted in a ~5% drop, similar to that observed with AlphaFold3 models (~5%), despite the latter offering lower ligand root mean square deviation. However, using AlphaFold2 receptors with docking led to a greater performance loss (10%-20%), comparable to homology models with high sequence identity. Homology models based on low-identity templates showed over 30% decline. These performance differences were most pronounced for interface mutations and low molecular weight ligands. While AlphaFold models offer accurate protein and interaction predictions, they lack mutation-specific information, such as dynamic changes, highlighting the need for complementary mutation-aware methods for reliable analysis. Our findings provide insights into interpreting mutation effects on ligand binding using predicted structures and can guide more robust assessments of drug resistance mechanisms in silico.
{"title":"Systematic evaluation of computational tools to predict the effects of mutations on protein-ligand binding affinity in the absence of experimental structures.","authors":"Qisheng Pan, Stephanie Portelli, Thanh Binh Nguyen, David B Ascher","doi":"10.1093/bib/bbag035","DOIUrl":"10.1093/bib/bbag035","url":null,"abstract":"<p><p>Drug resistance caused by mutations is a significant global health concern. One way to better understand this phenomenon is by studying changes in protein-ligand binding affinity upon mutation. While recent advances in protein modelling, such as AlphaFold2 and AlphaFold3, have transformed structural assessments, their utility in predicting mutation-induced binding affinity changes remains underexplored. We evaluated various mutation-based methods and scoring functions using computer-generated protein-ligand complexes. Compared to a baseline using experimental structures, we observed a performance drop ranging from 5% to 30% across different computational models. Specifically, using experimental receptors with docked ligands resulted in a ~5% drop, similar to that observed with AlphaFold3 models (~5%), despite the latter offering lower ligand root mean square deviation. However, using AlphaFold2 receptors with docking led to a greater performance loss (10%-20%), comparable to homology models with high sequence identity. Homology models based on low-identity templates showed over 30% decline. These performance differences were most pronounced for interface mutations and low molecular weight ligands. While AlphaFold models offer accurate protein and interaction predictions, they lack mutation-specific information, such as dynamic changes, highlighting the need for complementary mutation-aware methods for reliable analysis. Our findings provide insights into interpreting mutation effects on ligand binding using predicted structures and can guide more robust assessments of drug resistance mechanisms in silico.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyong Yu, Jia Wen, Yang Gao, Zhaoqing Yang, Xu Wang, Yan Lu, Jiayou Chu, Dilinuer Maimaitiyiming, Shuhua Xu
The Yugur and Uyghur people of northwestern China share documented Early Medieval origins, yet the evolutionary processes that shaped their present-day genomes remain unresolved. Here, we generate high-coverage whole-genome sequences for the Yugurs and compare them with Uyghur genomes to reconstruct their demographic histories, ancestry profiles, and adaptive trajectories. Both groups derive from mixtures of East Eurasian ancestry (EEA) and West Eurasian ancestry (WEA) but in sharply contrasting proportions: the Yugur retain predominantly EEA (~90%), whereas the Uyghur harbor a near-equal balance. Modeling reveals distinct episodes of admixture in Gansu and Xinjiang, with identity-by-descent patterns indicating persistent but substantially reduced genetic continuity (FST = 0.021). Strikingly, despite their EEA-rich background, the Yugur show WEA-shifted allele frequencies at craniofacial loci, including EDAR and LIMS1, suggesting subtle trait convergence. Signals of recent positive selection further differentiate the two populations: the Yugur display strong selection on the FADS locus linked to lipid metabolism, whereas both groups exhibit selection at PPARA but with greater intensity in the Uyghur, consistent with their higher WEA. Functional enrichment analyses highlight overlapping immune and metabolic pathways, consistent with shared biological patterns shaped by demographic history and long-term residence in Northwestern China. Together, these findings show how divergent admixture proportions and region-specific natural selection have produced distinct genomic architectures in two historically related populations along the Silk Road.
{"title":"Divergent Eurasian ancestry and local adaptation shape the genetic landscapes of the Yugur and Uyghur.","authors":"Siyong Yu, Jia Wen, Yang Gao, Zhaoqing Yang, Xu Wang, Yan Lu, Jiayou Chu, Dilinuer Maimaitiyiming, Shuhua Xu","doi":"10.1093/bib/bbag041","DOIUrl":"10.1093/bib/bbag041","url":null,"abstract":"<p><p>The Yugur and Uyghur people of northwestern China share documented Early Medieval origins, yet the evolutionary processes that shaped their present-day genomes remain unresolved. Here, we generate high-coverage whole-genome sequences for the Yugurs and compare them with Uyghur genomes to reconstruct their demographic histories, ancestry profiles, and adaptive trajectories. Both groups derive from mixtures of East Eurasian ancestry (EEA) and West Eurasian ancestry (WEA) but in sharply contrasting proportions: the Yugur retain predominantly EEA (~90%), whereas the Uyghur harbor a near-equal balance. Modeling reveals distinct episodes of admixture in Gansu and Xinjiang, with identity-by-descent patterns indicating persistent but substantially reduced genetic continuity (FST = 0.021). Strikingly, despite their EEA-rich background, the Yugur show WEA-shifted allele frequencies at craniofacial loci, including EDAR and LIMS1, suggesting subtle trait convergence. Signals of recent positive selection further differentiate the two populations: the Yugur display strong selection on the FADS locus linked to lipid metabolism, whereas both groups exhibit selection at PPARA but with greater intensity in the Uyghur, consistent with their higher WEA. Functional enrichment analyses highlight overlapping immune and metabolic pathways, consistent with shared biological patterns shaped by demographic history and long-term residence in Northwestern China. Together, these findings show how divergent admixture proportions and region-specific natural selection have produced distinct genomic architectures in two historically related populations along the Silk Road.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12885099/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146149167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fragment-based molecular generation has emerged as a promising paradigm in structure-based drug design (SBDD), deriving effective compounds with advanced properties, including chemical validity, synthetic feasibility, pharmacological relevance, etc. However, existing approaches often struggle with generating molecules which can both conform to 3D structural constraints and retain chemical plausibility. This is largely due to the fact that prior works often treat scaffolds and R-groups of molecules indiscriminately, overlooking the distinct semantic roles played by scaffolds and R-groups. Specifically, the scaffold serves as the rigid structural backbone that determines the global geometric topology and binding pose, whereas R-groups act as functional substituents responsible for fine-tuning local physicochemical interactions. Therefore, in this work, we propose fragment-based dual conditional diffusion (FDC-Diff), a novel dual conditional diffusion framework that integrates chemical priors and structural cues for fragment-based molecular generation. Unlike traditional de novo methods that generate atoms sequentially, FDC-Diff decomposes the molecule generation process into two semantically complementary stages. Given the protein pocket and an initial fragment, in the first stage, a spatially constrained scaffold is constructed to capture the global molecular topology. In the second stage, R-groups onto the obtained scaffold are elaborated to capture local semantics to further refine molecular properties. To ensure synthetic accessibility, initial fragments and scaffold-modification hierarchy are derived from curated reaction rules, and a physical-chemistry-inspired refinement step is applied to optimize final conformations. Experimental results on multiple SBDD benchmarks demonstrate that FDC-Diff achieves state-of-the-art performance in terms of comprehensive evaluations. Furthermore, our model excels at producing chemically valid, spatially compatible, and pharmacologically relevant molecules, suggesting its potential as a feasible tool for fragment-based drug design.
{"title":"An effective fragment-based dual conditional diffusion framework for molecular generation.","authors":"Haotian Chen, Yiting Shen, Jichun Li, Weizhong Zhao","doi":"10.1093/bib/bbaf727","DOIUrl":"10.1093/bib/bbaf727","url":null,"abstract":"<p><p>Fragment-based molecular generation has emerged as a promising paradigm in structure-based drug design (SBDD), deriving effective compounds with advanced properties, including chemical validity, synthetic feasibility, pharmacological relevance, etc. However, existing approaches often struggle with generating molecules which can both conform to 3D structural constraints and retain chemical plausibility. This is largely due to the fact that prior works often treat scaffolds and R-groups of molecules indiscriminately, overlooking the distinct semantic roles played by scaffolds and R-groups. Specifically, the scaffold serves as the rigid structural backbone that determines the global geometric topology and binding pose, whereas R-groups act as functional substituents responsible for fine-tuning local physicochemical interactions. Therefore, in this work, we propose fragment-based dual conditional diffusion (FDC-Diff), a novel dual conditional diffusion framework that integrates chemical priors and structural cues for fragment-based molecular generation. Unlike traditional de novo methods that generate atoms sequentially, FDC-Diff decomposes the molecule generation process into two semantically complementary stages. Given the protein pocket and an initial fragment, in the first stage, a spatially constrained scaffold is constructed to capture the global molecular topology. In the second stage, R-groups onto the obtained scaffold are elaborated to capture local semantics to further refine molecular properties. To ensure synthetic accessibility, initial fragments and scaffold-modification hierarchy are derived from curated reaction rules, and a physical-chemistry-inspired refinement step is applied to optimize final conformations. Experimental results on multiple SBDD benchmarks demonstrate that FDC-Diff achieves state-of-the-art performance in terms of comprehensive evaluations. Furthermore, our model excels at producing chemically valid, spatially compatible, and pharmacologically relevant molecules, suggesting its potential as a feasible tool for fragment-based drug design.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12814976/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146002891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lena Maria Hackl, Fabian Neuhaus, Sabine Ameling, Uwe Völker, Jan Baumbach, Olga Tsoy
Alternative splicing is a crucial mechanism of gene regulation that enables condition- and tissue-specific expression of gene isoforms. Its dysregulation plays a role in various diseases such as cancer, neurological disorders, and metabolic conditions. Despite its importance, accurate detection of alternative splicing events remains challenging. Comprehensive alternative splicing event detection typically requires deep sequencing with over 100 million reads; however, much of the publicly accessible RNA sequencing data is of lower sequencing depth. Recent advances, particularly deep learning models working with genomic sequences, offer new avenues for predicting alternative splicing without reliance on high sequencing depth data. Our study addresses the question: Can we utilize the vast repository of publicly available RNA sequencing data for comprehensive alternative splicing detection, despite the low sequencing depth? Our results demonstrate the potential of sequence-based deep learning tools such as AlphaGenome, SpliceAI and DeepSplice for initial hypothesis development and as additional filters in standard RNA sequencing pipelines, especially when sequencing depth is limited. Nonetheless, validation with higher sequencing depths remains essential for confirmation of splice events. Overall, our findings underscore the need for integrative methods combining genomic sequence data and RNA sequencing data for the prediction of tissue- and condition-specific alternative splicing in resource-limited settings.
{"title":"Detection of alternative splicing: deep sequencing or deep learning?","authors":"Lena Maria Hackl, Fabian Neuhaus, Sabine Ameling, Uwe Völker, Jan Baumbach, Olga Tsoy","doi":"10.1093/bib/bbaf705","DOIUrl":"10.1093/bib/bbaf705","url":null,"abstract":"<p><p>Alternative splicing is a crucial mechanism of gene regulation that enables condition- and tissue-specific expression of gene isoforms. Its dysregulation plays a role in various diseases such as cancer, neurological disorders, and metabolic conditions. Despite its importance, accurate detection of alternative splicing events remains challenging. Comprehensive alternative splicing event detection typically requires deep sequencing with over 100 million reads; however, much of the publicly accessible RNA sequencing data is of lower sequencing depth. Recent advances, particularly deep learning models working with genomic sequences, offer new avenues for predicting alternative splicing without reliance on high sequencing depth data. Our study addresses the question: Can we utilize the vast repository of publicly available RNA sequencing data for comprehensive alternative splicing detection, despite the low sequencing depth? Our results demonstrate the potential of sequence-based deep learning tools such as AlphaGenome, SpliceAI and DeepSplice for initial hypothesis development and as additional filters in standard RNA sequencing pipelines, especially when sequencing depth is limited. Nonetheless, validation with higher sequencing depths remains essential for confirmation of splice events. Overall, our findings underscore the need for integrative methods combining genomic sequence data and RNA sequencing data for the prediction of tissue- and condition-specific alternative splicing in resource-limited settings.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790623/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antimicrobial resistance poses a significant challenge to conventional antibiotics, underscoring the urgent need for alternative therapeutic strategies. Antimicrobial peptides (AMPs) have emerged as promising candidates due to their broad-spectrum antibacterial activity and distinct mechanisms of action. This study presents ANIA, a deep learning framework developed to predict the minimum inhibitory concentration (MIC) values of AMPs against three clinically significant bacteria: Staphylococcus aureus, Escherichia coli, and Pseudomonas aeruginosa. ANIA leverages Chaos Game Representation (CGR) to transform AMP sequences into frequency-based image features, which are subsequently processed through a hybrid architecture comprising stacked Inception modules, a Transformer encoder, and a regression head. This integrative architecture enables ANIA to capture both local motif-based features and global contextual patterns embedded within AMP sequences. In benchmarking experiments, ANIA achieved notably superior performance compared to existing tools, including ESKAPEE-Pred, AMPActiPred, and esAMPMIC, achieving higher correlation coefficients and lower predictive errors across all bacteria targets, with the most pronounced improvement observed for P. aeruginosa, a pathogen renowned for its multidrug resistance. Specifically, ANIA achieved PCCs of 0.75-0.79 and MSEs of 0.23-0.26 across all species. Furthermore, motif-based interpretability analyses combining Grad-CAM visualizations, correlation heatmaps, motif frequency distributions, and hydrophobicity profiling revealed biologically meaningful subregions within the CGR matrix that are plausibly associated with antimicrobial efficacy. In conclusion, this study develops ANIA as a robust predictive tool for MIC estimation, offering valuable insights into the design of effective antimicrobial agents and contributing to the fight against antimicrobial resistance. A user-friendly web server for ANIA is available at https://biomics.lab.nycu.edu.tw/ANIA/.
{"title":"ANIA: an inception-attention network for predicting minimum inhibitory concentration of antimicrobial peptides.","authors":"Yen-Peng Chiu, Lantian Yao, Yun Tang, Chia-Ru Chung, Yuxuan Pang, Ying-Chih Chiang, Tzong-Yi Lee","doi":"10.1093/bib/bbag023","DOIUrl":"10.1093/bib/bbag023","url":null,"abstract":"<p><p>Antimicrobial resistance poses a significant challenge to conventional antibiotics, underscoring the urgent need for alternative therapeutic strategies. Antimicrobial peptides (AMPs) have emerged as promising candidates due to their broad-spectrum antibacterial activity and distinct mechanisms of action. This study presents ANIA, a deep learning framework developed to predict the minimum inhibitory concentration (MIC) values of AMPs against three clinically significant bacteria: Staphylococcus aureus, Escherichia coli, and Pseudomonas aeruginosa. ANIA leverages Chaos Game Representation (CGR) to transform AMP sequences into frequency-based image features, which are subsequently processed through a hybrid architecture comprising stacked Inception modules, a Transformer encoder, and a regression head. This integrative architecture enables ANIA to capture both local motif-based features and global contextual patterns embedded within AMP sequences. In benchmarking experiments, ANIA achieved notably superior performance compared to existing tools, including ESKAPEE-Pred, AMPActiPred, and esAMPMIC, achieving higher correlation coefficients and lower predictive errors across all bacteria targets, with the most pronounced improvement observed for P. aeruginosa, a pathogen renowned for its multidrug resistance. Specifically, ANIA achieved PCCs of 0.75-0.79 and MSEs of 0.23-0.26 across all species. Furthermore, motif-based interpretability analyses combining Grad-CAM visualizations, correlation heatmaps, motif frequency distributions, and hydrophobicity profiling revealed biologically meaningful subregions within the CGR matrix that are plausibly associated with antimicrobial efficacy. In conclusion, this study develops ANIA as a robust predictive tool for MIC estimation, offering valuable insights into the design of effective antimicrobial agents and contributing to the fight against antimicrobial resistance. A user-friendly web server for ANIA is available at https://biomics.lab.nycu.edu.tw/ANIA/.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12895073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146149120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatial domain identification is an essential task for revealing spatial heterogeneity within tissues, providing insights into disease mechanisms, tissue development, and the cellular microenvironment. In recent years, spatial multi-omics has emerged as the new frontier in spatial domain identification that offers deeper insights into the complex interplay and functional dynamics of heterogeneous cell communities within their native tissue context. Most existing methods rely on static graph structures that treat all neighboring cells uniformly, failing to capture the nuanced cellular interactions within the microenvironment and thus blurring functional boundaries. Furthermore, cross-modal reconstruction performance is often degraded by overfitting to modality-specific noise, which may impair the precise delineation of spatial domains. Therefore, we present GATCL, a novel deep learning framework that integrates a graph attention network with contrastive learning (CL) for robust spatial domain identification. First, GATCL leverages the graph attention mechanism to dynamically assign weights to neighboring spots, adaptively modeling the complex cellular architecture. Second, it implements a cross-modal CL strategy that forces representations from the same spatial location to be similar while pushing those from different locations apart, thereby achieving robust alignment between modalities. Comprehensive experiments across six distinct datasets (spanning transcriptome, proteome, and chromatin) reveal that GATCL is superior to seven representative methods across six key evaluation metrics.
{"title":"GATCL: graph attention network meets contrastive learning for spatial domain identification.","authors":"Jichong Mu, Yachen Yao, Qiuhao Chen, Jiqiu Sun, Tianyi Zhao","doi":"10.1093/bib/bbag043","DOIUrl":"10.1093/bib/bbag043","url":null,"abstract":"<p><p>Spatial domain identification is an essential task for revealing spatial heterogeneity within tissues, providing insights into disease mechanisms, tissue development, and the cellular microenvironment. In recent years, spatial multi-omics has emerged as the new frontier in spatial domain identification that offers deeper insights into the complex interplay and functional dynamics of heterogeneous cell communities within their native tissue context. Most existing methods rely on static graph structures that treat all neighboring cells uniformly, failing to capture the nuanced cellular interactions within the microenvironment and thus blurring functional boundaries. Furthermore, cross-modal reconstruction performance is often degraded by overfitting to modality-specific noise, which may impair the precise delineation of spatial domains. Therefore, we present GATCL, a novel deep learning framework that integrates a graph attention network with contrastive learning (CL) for robust spatial domain identification. First, GATCL leverages the graph attention mechanism to dynamically assign weights to neighboring spots, adaptively modeling the complex cellular architecture. Second, it implements a cross-modal CL strategy that forces representations from the same spatial location to be similar while pushing those from different locations apart, thereby achieving robust alignment between modalities. Comprehensive experiments across six distinct datasets (spanning transcriptome, proteome, and chromatin) reveal that GATCL is superior to seven representative methods across six key evaluation metrics.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12900075/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146177761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To understand disease mechanisms and advance therapies, accurately predicting how missense mutations alter protein-DNA binding affinity is critical. Many existing models neglect the unique characteristics of missense mutations in both double-stranded DNA-binding proteins (DSBs) and single-stranded DNA-binding proteins (SSBs). To address this issue, we constructed a comprehensive dataset from diverse sources. By leveraging sequence-based embeddings from pretrained protein language models including ESM2, ProtTrans, and ESM1v, we developed iDLDDG, a deep learning framework that integrates multi-scale structural and evolutionary information via a multi-channel architecture. To balance residue-wise information density against entropy, our entropy-based algorithm determined 181 residues as optimal for modeling biophysical constraints. This approach enhances predictive accuracy and computational efficiency, thereby supporting large-scale assessments of mutation effects in DNA-binding proteins. iDLDDG achieves state-of-the-art performance, with a 10-fold cross-validation PCC of 0.755 on MPD276 and 0.632 on independent test sets encompassing both DSBs and SSBs, significantly surpassing existing methods. By establishing the first computational framework that rigorously differentiates DSB and SSB mutation mechanisms, our work provides a foundation for high-accuracy prediction of pathological mutations in DNA-binding proteins.
{"title":"iDLDDG: predicting protein stability changes from missense mutations in DNA-binding proteins using integrated deep learning features.","authors":"Xuan Yu, Fang Ge, Dong-Jun Yu, Zhaohong Deng","doi":"10.1093/bib/bbag050","DOIUrl":"10.1093/bib/bbag050","url":null,"abstract":"<p><p>To understand disease mechanisms and advance therapies, accurately predicting how missense mutations alter protein-DNA binding affinity is critical. Many existing models neglect the unique characteristics of missense mutations in both double-stranded DNA-binding proteins (DSBs) and single-stranded DNA-binding proteins (SSBs). To address this issue, we constructed a comprehensive dataset from diverse sources. By leveraging sequence-based embeddings from pretrained protein language models including ESM2, ProtTrans, and ESM1v, we developed iDLDDG, a deep learning framework that integrates multi-scale structural and evolutionary information via a multi-channel architecture. To balance residue-wise information density against entropy, our entropy-based algorithm determined 181 residues as optimal for modeling biophysical constraints. This approach enhances predictive accuracy and computational efficiency, thereby supporting large-scale assessments of mutation effects in DNA-binding proteins. iDLDDG achieves state-of-the-art performance, with a 10-fold cross-validation PCC of 0.755 on MPD276 and 0.632 on independent test sets encompassing both DSBs and SSBs, significantly surpassing existing methods. By establishing the first computational framework that rigorously differentiates DSB and SSB mutation mechanisms, our work provides a foundation for high-accuracy prediction of pathological mutations in DNA-binding proteins.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12903960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146194033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Faizan-Khan, Roger Giné, Josep M Badia, Maribel Pérez-Ribera, Manuel Ruiz-Botella, Alexandra Junza, Jordi Capellades, Iván Pérez-López, Shipei Xing, Abubaker Patan, Laura Brugnara, Anna Novials, Joan-Marc Servitja, Maria Vinaixa, Pieter C Dorrestein, Marta Sales-Pardo, Roger Guimerà, Oscar Yanes
Machine learning offers a promising path to annotating the large number of unidentified MS/MS spectra in metabolomics, addressing the limited coverage of current reference spectral libraries. However, existing methods often struggle with the high dimensionality and sparsity of MS/MS spectra and metabolite structures. ChemEmbed tackles these challenges by integrating multidimensional, continuous vector representations of chemical structures with enhanced MS/MS spectra. This enhancement is achieved by merging spectra across multiple collision energies and incorporating calculated neutral losses from 38 472 distinct compounds, providing richer input for a convolutional neural network (CNN). ChemEmbed ranks the correct candidate first in over 42% of cases and within the top five in more than 76% of cases. In external benchmarks such as CASMI 2016 and 2022, ChemEmbed outperforms SIRIUS 6, the current state-of-the-art in computational metabolomics. We applied ChemEmbed to predict structures in the Annotated Recurrent Unidentified Spectra (ARUS) dataset and confirmed 25 previously unidentified compounds. These findings demonstrate ChemEmbed's potential as a robust, scalable tool for accelerating metabolite identification in untargeted mass spectrometry workflows.
{"title":"ChemEmbed: a deep learning framework for metabolite identification using enhanced MS/MS data and multidimensional molecular embeddings.","authors":"Muhammad Faizan-Khan, Roger Giné, Josep M Badia, Maribel Pérez-Ribera, Manuel Ruiz-Botella, Alexandra Junza, Jordi Capellades, Iván Pérez-López, Shipei Xing, Abubaker Patan, Laura Brugnara, Anna Novials, Joan-Marc Servitja, Maria Vinaixa, Pieter C Dorrestein, Marta Sales-Pardo, Roger Guimerà, Oscar Yanes","doi":"10.1093/bib/bbag054","DOIUrl":"10.1093/bib/bbag054","url":null,"abstract":"<p><p>Machine learning offers a promising path to annotating the large number of unidentified MS/MS spectra in metabolomics, addressing the limited coverage of current reference spectral libraries. However, existing methods often struggle with the high dimensionality and sparsity of MS/MS spectra and metabolite structures. ChemEmbed tackles these challenges by integrating multidimensional, continuous vector representations of chemical structures with enhanced MS/MS spectra. This enhancement is achieved by merging spectra across multiple collision energies and incorporating calculated neutral losses from 38 472 distinct compounds, providing richer input for a convolutional neural network (CNN). ChemEmbed ranks the correct candidate first in over 42% of cases and within the top five in more than 76% of cases. In external benchmarks such as CASMI 2016 and 2022, ChemEmbed outperforms SIRIUS 6, the current state-of-the-art in computational metabolomics. We applied ChemEmbed to predict structures in the Annotated Recurrent Unidentified Spectra (ARUS) dataset and confirmed 25 previously unidentified compounds. These findings demonstrate ChemEmbed's potential as a robust, scalable tool for accelerating metabolite identification in untargeted mass spectrometry workflows.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12903953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146194080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Zhang, Ming Li, David M Haas, C Noel Bairey Merz, Tsegaselassie Workalemahu, Kelli Ryckman, Janet M Catov, Lisa D Levine, Alexa Freedman, George R Saade, Jiaqi Hu, Hongyu Zhao, Xihao Li, Nianjun Liu, Qi Yan
Mendelian randomization (MR) has become an important technique for establishing causal relationships between risk factors and health outcomes. By using genetic variants as instrumental variables, it can mitigate bias due to confounding and reverse causation in observational studies. Current MR analyses have predominantly used common genetic variants as instruments, which represent only part of the genetic architecture of complex traits. Rare variants, which can have larger effect sizes and provide unique biological insights, have been understudied due to statistical and methodological challenges. We introduce MR-common and annotation-informed rare variants (MR-CARV), a novel framework integrating common and rare genetic variants in two-sample MR. This method leverages comprehensive genetic data made available by high-throughput sequencing technologies and large-scale consortia. Rare variants are aggregated into functional categories, such as gene-coding, gene-noncoding, and nongene regions, by leveraging variant annotations and biological impact as weights. The effects of rare variant sets are then estimated with STAARpipeline and combined with the estimated effects of common variants by the existing MR methods. Simulation studies demonstrate that MR-CARV maintains robust type I error and achieves higher statistical power, with up to a 66.3% relative increase compared with existing methods only based on common variants. Consistent with these findings, application to real data on high-density lipoprotein cholesterol (HDL-C) and preeclampsia showed that MR-CARV [inverse variance weighted (IVW)] yielded a more precise and statistically significant effect estimate (-0.020, SE = 0.0102, $P$ =.0470) than IVW using only common variants (-0.023, SE = 0.0123, $P$ =.0659).
{"title":"A novel two-sample Mendelian randomization framework integrating common and rare variants: application to assess the effect of HDL-C on preeclampsia risk.","authors":"Yu Zhang, Ming Li, David M Haas, C Noel Bairey Merz, Tsegaselassie Workalemahu, Kelli Ryckman, Janet M Catov, Lisa D Levine, Alexa Freedman, George R Saade, Jiaqi Hu, Hongyu Zhao, Xihao Li, Nianjun Liu, Qi Yan","doi":"10.1093/bib/bbaf649","DOIUrl":"10.1093/bib/bbaf649","url":null,"abstract":"<p><p>Mendelian randomization (MR) has become an important technique for establishing causal relationships between risk factors and health outcomes. By using genetic variants as instrumental variables, it can mitigate bias due to confounding and reverse causation in observational studies. Current MR analyses have predominantly used common genetic variants as instruments, which represent only part of the genetic architecture of complex traits. Rare variants, which can have larger effect sizes and provide unique biological insights, have been understudied due to statistical and methodological challenges. We introduce MR-common and annotation-informed rare variants (MR-CARV), a novel framework integrating common and rare genetic variants in two-sample MR. This method leverages comprehensive genetic data made available by high-throughput sequencing technologies and large-scale consortia. Rare variants are aggregated into functional categories, such as gene-coding, gene-noncoding, and nongene regions, by leveraging variant annotations and biological impact as weights. The effects of rare variant sets are then estimated with STAARpipeline and combined with the estimated effects of common variants by the existing MR methods. Simulation studies demonstrate that MR-CARV maintains robust type I error and achieves higher statistical power, with up to a 66.3% relative increase compared with existing methods only based on common variants. Consistent with these findings, application to real data on high-density lipoprotein cholesterol (HDL-C) and preeclampsia showed that MR-CARV [inverse variance weighted (IVW)] yielded a more precise and statistically significant effect estimate (-0.020, SE = 0.0102, $P$ =.0470) than IVW using only common variants (-0.023, SE = 0.0123, $P$ =.0659).</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777983/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}