Genome-wide association studies (GWASs) have been conducted primarily in European (EUR) populations, limiting insights into underrepresented groups such as East Asian (EAS), but cross-ancestry GWASs have demonstrated high trans-ethnic genetic similarity between EUR and non-EUR populations. To enhance association analysis power in EAS populations, we propose tranScore, a novel summary-statistics-based transfer learning method that leverages trans-ethnic genetic similarity through hierarchical modeling. By considering EUR as auxiliary population, tranScore performs joint testing of genetic effects in auxiliary and target populations via well-established P-value combination procedures. Simulations demonstrate that tranScore maintains control of type I error rates and provides substantial power gains for diverse genetic architectures, showing robustness against various challenges including incomplete SNP overlap and effect heterogeneity. In the real-data application of eight diseases from the China Kadoorie Biobank (CKB), after incorporating the genetic information of the EUR population, tranScore identified significantly more genes than the traditional score test which ignored such information. Approximately 41.9% of discovered genes were replicated in the Biobank Japan cohort. Overall, tranScore represents a flexible and powerful statistical approach for association analysis of complex diseases and traits through transfer learning of shared genetic similarities between the auxiliary and target populations.
{"title":"An integrative association analysis for complex diseases in underrepresented groups by leveraging the trans-ethnic genetic similarity.","authors":"Shuo Zhang, Jike Qi, Yuchen Jiang, Hua Lin, Xinyi Wang, Ting Wang, Hongyan Cao, Ping Zeng","doi":"10.1093/bib/bbag103","DOIUrl":"10.1093/bib/bbag103","url":null,"abstract":"<p><p>Genome-wide association studies (GWASs) have been conducted primarily in European (EUR) populations, limiting insights into underrepresented groups such as East Asian (EAS), but cross-ancestry GWASs have demonstrated high trans-ethnic genetic similarity between EUR and non-EUR populations. To enhance association analysis power in EAS populations, we propose tranScore, a novel summary-statistics-based transfer learning method that leverages trans-ethnic genetic similarity through hierarchical modeling. By considering EUR as auxiliary population, tranScore performs joint testing of genetic effects in auxiliary and target populations via well-established P-value combination procedures. Simulations demonstrate that tranScore maintains control of type I error rates and provides substantial power gains for diverse genetic architectures, showing robustness against various challenges including incomplete SNP overlap and effect heterogeneity. In the real-data application of eight diseases from the China Kadoorie Biobank (CKB), after incorporating the genetic information of the EUR population, tranScore identified significantly more genes than the traditional score test which ignored such information. Approximately 41.9% of discovered genes were replicated in the Biobank Japan cohort. Overall, tranScore represents a flexible and powerful statistical approach for association analysis of complex diseases and traits through transfer learning of shared genetic similarities between the auxiliary and target populations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12971055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147389570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drug repurposing provides a cost-effective and time-efficient strategy to accelerate therapeutic discovery, yet most computational approaches fail to capture the multi-scale biomedical mechanisms underlying drug-disease associations, limiting interpretability. We introduce BioMNEDR (mechanism-guided network embedding for drug repurposing) that integrates heterogeneous biomedical networks through biologically curated meta-paths. BioMNEDR generates low-dimensional embeddings preserving protein-protein interactions and functional hierarchies. It further integrates multi-path predictions through an XGBoost classifier. The framework achieves state-of-the-art performance, consistently surpassing strong baselines across AUROC, AUPR, recall, and F1-score, while maintaining a balanced trade-off in precision. Case studies further highlight its practical utility, demonstrating the ability to rediscover approved drugs and prioritize promising candidates, such as cromoglicic acid for Alzheimer's disease. By explicitly modeling multi-scale mechanisms, BioMNEDR enhances both predictive accuracy and biomedical interpretability, offering a robust computational framework for systematic drug repurposing.
{"title":"BioMNEDR: mechanism-guided network embedding for drug repurposing.","authors":"Yizhou Zeng, Lei Wang, Xueming Liu","doi":"10.1093/bib/bbag101","DOIUrl":"10.1093/bib/bbag101","url":null,"abstract":"<p><p>Drug repurposing provides a cost-effective and time-efficient strategy to accelerate therapeutic discovery, yet most computational approaches fail to capture the multi-scale biomedical mechanisms underlying drug-disease associations, limiting interpretability. We introduce BioMNEDR (mechanism-guided network embedding for drug repurposing) that integrates heterogeneous biomedical networks through biologically curated meta-paths. BioMNEDR generates low-dimensional embeddings preserving protein-protein interactions and functional hierarchies. It further integrates multi-path predictions through an XGBoost classifier. The framework achieves state-of-the-art performance, consistently surpassing strong baselines across AUROC, AUPR, recall, and F1-score, while maintaining a balanced trade-off in precision. Case studies further highlight its practical utility, demonstrating the ability to rediscover approved drugs and prioritize promising candidates, such as cromoglicic acid for Alzheimer's disease. By explicitly modeling multi-scale mechanisms, BioMNEDR enhances both predictive accuracy and biomedical interpretability, offering a robust computational framework for systematic drug repurposing.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12971018/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147389581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingzhan Lu, Johan H Thygesen, Robin N Beaumont, Michael N Weedon, Harry D Green
As genome-wide association studies (GWAS) studies move from array-based genotyping to whole exome and genome sequencing, there is a significant increase in cost. Applying an appropriate technique for the selection of which controls to include, in large studies where more potential controls are available than needed for the study, may be a useful technique for minimizing resource intensity whilst maintaining statistical power. We evaluated three control selection strategies in prostate cancer GWAS using 15 250 UK Biobank cases: (a) all controls, (b) matched controls, and (c) random selection. Both (b) and (c) achieved comparable power in detecting significant loci relative to (a), but matched controls (b) showed greater consistency in identifying leading single nucleotide polymorphisms (SNPs). However, using (b) matched controls reduced discovery power by ~30% compared with (a) all controls, highlighting a trade-off. Matching controls (1:4 ratio) offers a cost-effective approach for targeted SNP analysis across phenotypes but may miss novel associations.
{"title":"Impact of control selection strategies on GWAS results: a study of prostate cancer in the UK Biobank.","authors":"Jingzhan Lu, Johan H Thygesen, Robin N Beaumont, Michael N Weedon, Harry D Green","doi":"10.1093/bib/bbag102","DOIUrl":"10.1093/bib/bbag102","url":null,"abstract":"<p><p>As genome-wide association studies (GWAS) studies move from array-based genotyping to whole exome and genome sequencing, there is a significant increase in cost. Applying an appropriate technique for the selection of which controls to include, in large studies where more potential controls are available than needed for the study, may be a useful technique for minimizing resource intensity whilst maintaining statistical power. We evaluated three control selection strategies in prostate cancer GWAS using 15 250 UK Biobank cases: (a) all controls, (b) matched controls, and (c) random selection. Both (b) and (c) achieved comparable power in detecting significant loci relative to (a), but matched controls (b) showed greater consistency in identifying leading single nucleotide polymorphisms (SNPs). However, using (b) matched controls reduced discovery power by ~30% compared with (a) all controls, highlighting a trade-off. Matching controls (1:4 ratio) offers a cost-effective approach for targeted SNP analysis across phenotypes but may miss novel associations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12971001/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147389643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daohong Gong, Xiaowei Xie, Jianxin Tang, Shiliang Li, Honglin Li
RNA-based technologies have demonstrated significant potential for diverse applications, ranging from vaccination to gene editing. However, their widespread adoption is limited by the critical challenge of efficient delivery. Lipid nanoparticles (LNPs) have emerged as a widely utilized RNA delivery system, yet their formulation design and optimization primarily rely on empirical trial-and-error, which is labor-intensive, time-consuming, and cost-prohibitive, thus hindering the rapid development of RNA therapeutics. To facilitate the early-stage design and optimization of LNPs for enhanced delivery efficiency, in this study, we construct LNPs-TE, a benchmark dataset comprising over 10 000 experimentally measured transfection efficiency (TE) values, and introduce LNPs integrated feature fusion Transformer (LIFT), a deep learning framework for LNPs TE prediction. Comprehensive experiments demonstrate that LIFT effectively integrates multidimensional molecular representations of ionizable lipids, the key component in LNPs formulation, achieving superior predictive performance, with an average Pearson correlation coefficient of 0.845 for regression and an area under the receiver operating characteristic curve (AUC-ROC) of 0.818 for multi-class classification across multiple datasets. Through scaffold-based splitting and activity cliff tasks, we further validated the exceptional generalization ability and robustness of LIFT, which achieved over a 10% improvement in the coefficient of determination (R2) compared with state-of-the-art baseline models, highlighting its potential as a practical and stable approach for the virtual screening of efficient LNPs formulation. The relevant data, model and code are made publicly available at https://github.com/U12458/LIFT.
{"title":"Transformer-based multidimensional feature fusion for accurate prediction of lipid nanoparticles transfection efficiency.","authors":"Daohong Gong, Xiaowei Xie, Jianxin Tang, Shiliang Li, Honglin Li","doi":"10.1093/bib/bbag092","DOIUrl":"10.1093/bib/bbag092","url":null,"abstract":"<p><p>RNA-based technologies have demonstrated significant potential for diverse applications, ranging from vaccination to gene editing. However, their widespread adoption is limited by the critical challenge of efficient delivery. Lipid nanoparticles (LNPs) have emerged as a widely utilized RNA delivery system, yet their formulation design and optimization primarily rely on empirical trial-and-error, which is labor-intensive, time-consuming, and cost-prohibitive, thus hindering the rapid development of RNA therapeutics. To facilitate the early-stage design and optimization of LNPs for enhanced delivery efficiency, in this study, we construct LNPs-TE, a benchmark dataset comprising over 10 000 experimentally measured transfection efficiency (TE) values, and introduce LNPs integrated feature fusion Transformer (LIFT), a deep learning framework for LNPs TE prediction. Comprehensive experiments demonstrate that LIFT effectively integrates multidimensional molecular representations of ionizable lipids, the key component in LNPs formulation, achieving superior predictive performance, with an average Pearson correlation coefficient of 0.845 for regression and an area under the receiver operating characteristic curve (AUC-ROC) of 0.818 for multi-class classification across multiple datasets. Through scaffold-based splitting and activity cliff tasks, we further validated the exceptional generalization ability and robustness of LIFT, which achieved over a 10% improvement in the coefficient of determination (R2) compared with state-of-the-art baseline models, highlighting its potential as a practical and stable approach for the virtual screening of efficient LNPs formulation. The relevant data, model and code are made publicly available at https://github.com/U12458/LIFT.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147324773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sudipto Baul, Naima Ahmed Fahmi, Guangyu Wang, Hao Zheng, Ahmed Louri, Jeongsik Yong, Wei Zhang
Understanding how the 3D structure of the genome influences gene regulation is a growing area of interest, particularly in the context of alternative post-transcriptional regulatory events such as alternative splicing (AS) and alternative polyadenylation (APA). These processes are essential for generating transcript and protein diversity, and they are tightly coordinated with transcription. However, despite their biological importance, the relationship between chromatin interactions and alternative pre-messenger RNA regulation remains poorly understood. This gap largely stems from a lack of computational tools capable of integrating structural genomic data with RNA processing dynamics. Exploring how chromatin interactions and epigenetic landscapes shape these events is essential for uncovering the multilayered regulation of gene expression. To bridge this gap, we present EpGAT, a graph attention network-based model that integrates epigenetic read coverage and chromatin interaction data to predict and quantify AS and APA events. By explicitly modeling the spatial organization of the genome, EpGAT captures the regulatory influence of chromatin looping and long-range genomic interactions on RNA processing. The model's predictions are validated through rigorous cross-cell line and cross-chromosome evaluations, affirming its generalizability and reliability. Beyond prediction, EpGAT offers interpretability by tracing learned parameters back to genomic features, enabling the identification of active enhancers, mapping promoter-enhancer connectivity, and pinpointing the epigenetic factors most critical to specific RNA processing events. These capabilities make EpGAT a powerful tool for dissecting the complex interplay between genome architecture and transcriptomic regulation. More broadly, it provides a generalizable framework for multiple tasks to study the link between 3D genome organization, epigenetic signals, and RNA processing.
{"title":"EpGAT: integrating epigenetics and 3D genome structure to predict alternative splicing and polyadenylation.","authors":"Sudipto Baul, Naima Ahmed Fahmi, Guangyu Wang, Hao Zheng, Ahmed Louri, Jeongsik Yong, Wei Zhang","doi":"10.1093/bib/bbag091","DOIUrl":"10.1093/bib/bbag091","url":null,"abstract":"<p><p>Understanding how the 3D structure of the genome influences gene regulation is a growing area of interest, particularly in the context of alternative post-transcriptional regulatory events such as alternative splicing (AS) and alternative polyadenylation (APA). These processes are essential for generating transcript and protein diversity, and they are tightly coordinated with transcription. However, despite their biological importance, the relationship between chromatin interactions and alternative pre-messenger RNA regulation remains poorly understood. This gap largely stems from a lack of computational tools capable of integrating structural genomic data with RNA processing dynamics. Exploring how chromatin interactions and epigenetic landscapes shape these events is essential for uncovering the multilayered regulation of gene expression. To bridge this gap, we present EpGAT, a graph attention network-based model that integrates epigenetic read coverage and chromatin interaction data to predict and quantify AS and APA events. By explicitly modeling the spatial organization of the genome, EpGAT captures the regulatory influence of chromatin looping and long-range genomic interactions on RNA processing. The model's predictions are validated through rigorous cross-cell line and cross-chromosome evaluations, affirming its generalizability and reliability. Beyond prediction, EpGAT offers interpretability by tracing learned parameters back to genomic features, enabling the identification of active enhancers, mapping promoter-enhancer connectivity, and pinpointing the epigenetic factors most critical to specific RNA processing events. These capabilities make EpGAT a powerful tool for dissecting the complex interplay between genome architecture and transcriptomic regulation. More broadly, it provides a generalizable framework for multiple tasks to study the link between 3D genome organization, epigenetic signals, and RNA processing.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147324678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of gene expression at the individual cell level, with clustering serving as a critical step for identifying distinct cell populations. Due to the high dimensionality and sparsity of scRNA-seq data, existing approaches typically perform gene selection prior to clustering. However, treating feature selection as a separate preprocessing step can overlook latent clustering structure and often results in suboptimal outcomes, as it does not guarantee that the selected genes are informative for clustering. To address this limitation, we propose FSSC (Feature Selection for scRNA-seq Clustering), a unified framework for joint feature selection and clustering in scRNA-seq analysis. FSSC integrates a zero-inflated negative binomial (ZINB) autoencoder with a group Lasso penalty and a dedicated clustering loss. This joint optimization enables the model to simultaneously learn low-dimensional representations and select a compact set of cluster-discriminatory genes, preserving both the statistical characteristics of scRNA-seq data and its underlying cluster structure. Extensive experiments on both simulated and real scRNA-seq datasets demonstrate that FSSC consistently outperforms state-of-the-art methods in clustering accuracy and effectively identifies a compact, biologically meaningful set of marker genes.
单细胞RNA测序(scRNA-seq)能够在单个细胞水平上对基因表达进行高分辨率分析,聚类是鉴定不同细胞群的关键步骤。由于scRNA-seq数据的高维数和稀疏性,现有的方法通常在聚类之前进行基因选择。然而,将特征选择作为单独的预处理步骤可能会忽略潜在的聚类结构,并且通常会导致次优结果,因为它不能保证所选择的基因对聚类具有信息。为了解决这一限制,我们提出了FSSC (Feature Selection for scRNA-seq Clustering),这是一个统一的框架,用于scRNA-seq分析中的联合特征选择和聚类。FSSC集成了零膨胀负二项(ZINB)自编码器,具有组Lasso惩罚和专用聚类损失。这种联合优化使模型能够同时学习低维表示并选择一组紧凑的聚类歧视基因,同时保留scRNA-seq数据的统计特征及其潜在的聚类结构。在模拟和真实scRNA-seq数据集上进行的大量实验表明,FSSC在聚类准确性方面始终优于最先进的方法,并有效地识别出紧凑的、具有生物学意义的标记基因集。
{"title":"Integrating feature selection with unsupervised deep embedding for clustering single-cell RNA-seq data.","authors":"Cheng Zhong, Siqi Jiang, Zhi Wei","doi":"10.1093/bib/bbag082","DOIUrl":"10.1093/bib/bbag082","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of gene expression at the individual cell level, with clustering serving as a critical step for identifying distinct cell populations. Due to the high dimensionality and sparsity of scRNA-seq data, existing approaches typically perform gene selection prior to clustering. However, treating feature selection as a separate preprocessing step can overlook latent clustering structure and often results in suboptimal outcomes, as it does not guarantee that the selected genes are informative for clustering. To address this limitation, we propose FSSC (Feature Selection for scRNA-seq Clustering), a unified framework for joint feature selection and clustering in scRNA-seq analysis. FSSC integrates a zero-inflated negative binomial (ZINB) autoencoder with a group Lasso penalty and a dedicated clustering loss. This joint optimization enables the model to simultaneously learn low-dimensional representations and select a compact set of cluster-discriminatory genes, preserving both the statistical characteristics of scRNA-seq data and its underlying cluster structure. Extensive experiments on both simulated and real scRNA-seq datasets demonstrate that FSSC consistently outperforms state-of-the-art methods in clustering accuracy and effectively identifies a compact, biologically meaningful set of marker genes.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951082/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147324729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proteins are essential components of all living organisms and play a critical role in cellular survival. They have a broad range of applications, from clinical treatments to material engineering. This versatility has spurred the development of protein design, with amino acid sequence design being a crucial step in the process. Recent advancements in deep generative models have shown promise for protein sequence design. However, the scarcity of functional protein sequence data for certain types can hinder the training of these models, which often require large datasets. To address this challenge, we propose a hierarchical model named ProteinRG that can generate functional protein sequences using relatively small datasets. ProteinRG begins by generating a representation of a protein sequence, leveraging existing large protein sequence models, before producing a functional protein sequence. We have tested our model on various functional protein sequences and evaluated the results from three perspectives: multiple sequence alignment, t-SNE distribution analysis, and 3D structure prediction. The findings indicate that our generated protein sequences maintain both similarity to the original sequences and consistency with the desired functions. Moreover, our model demonstrates superior performance compared twith other generative models for protein sequence generation.
{"title":"De novo functional protein sequence generation: overcoming data scarcity through regeneration and large language models.","authors":"Chenyu Ren, Daihai He, Jian Huang","doi":"10.1093/bib/bbag095","DOIUrl":"10.1093/bib/bbag095","url":null,"abstract":"<p><p>Proteins are essential components of all living organisms and play a critical role in cellular survival. They have a broad range of applications, from clinical treatments to material engineering. This versatility has spurred the development of protein design, with amino acid sequence design being a crucial step in the process. Recent advancements in deep generative models have shown promise for protein sequence design. However, the scarcity of functional protein sequence data for certain types can hinder the training of these models, which often require large datasets. To address this challenge, we propose a hierarchical model named ProteinRG that can generate functional protein sequences using relatively small datasets. ProteinRG begins by generating a representation of a protein sequence, leveraging existing large protein sequence models, before producing a functional protein sequence. We have tested our model on various functional protein sequences and evaluated the results from three perspectives: multiple sequence alignment, t-SNE distribution analysis, and 3D structure prediction. The findings indicate that our generated protein sequences maintain both similarity to the original sequences and consistency with the desired functions. Moreover, our model demonstrates superior performance compared twith other generative models for protein sequence generation.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12967336/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147375849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the fundamental unit of life, cells coordinate biological activities through the interaction between microscopic molecular mechanisms and macroscopic tissue organization. Traditional research studies, experiments, and biochemical analyses, give rise to important insights, although they are restricted in spatiotemporal resolution and processing power, thereby precluding the understanding of dynamic cross-scale biological events . Breakthroughs in artificial intelligence (AI) have given birth to the AI virtual cell (AIVC) as a new way to do research. By integrating multi-omics data and mixing methods from multidisciplinary models, AIVC establishes a digital twin system to simulate cell functions and behaviors. AIVC still faces a number of pressing challenges that need to be addressed in its current development stage. In this review, we are proposing a unified definition and technical framework for AIVC and analyze in detail the cross-scale coupling mechanisms of the "gene-protein-pathway-cell" hierarchy. Furthermore, we decompose the technical construction framework of AIVC from cross-scale representation engineering, functional submodule design, and multi-component dynamic regulation mechanisms. Additionally, we summarize the existing models and datasets in the field to provide reference resources for researchers. Finally, we deeply discuss the challenges faced by AIVC, such as data heterogeneity and model interpretability, and aim to accelerate the research progress in the AIVC field while driving the life sciences to shift from observational analysis to a paradigm that integrates predictability and innovation. Despite being in the early stage, AIVC is a trending topic that has garnered widespread interest. This review aims to integrate existing models, datasets, and technical ideas to provide a unified framework for field development.
{"title":"Artificial intelligence-enabled multi-scale virtual cell: perspective, challenges, and opportunities.","authors":"Huasen Jiang, Xiaoyu Huang, Xiangpeng Bi, Wenjian Ma, Haibo Ni, Zhiqiang Wei, Pin Sun, Henggui Zhang, Shugang Zhang","doi":"10.1093/bib/bbag104","DOIUrl":"10.1093/bib/bbag104","url":null,"abstract":"<p><p>As the fundamental unit of life, cells coordinate biological activities through the interaction between microscopic molecular mechanisms and macroscopic tissue organization. Traditional research studies, experiments, and biochemical analyses, give rise to important insights, although they are restricted in spatiotemporal resolution and processing power, thereby precluding the understanding of dynamic cross-scale biological events . Breakthroughs in artificial intelligence (AI) have given birth to the AI virtual cell (AIVC) as a new way to do research. By integrating multi-omics data and mixing methods from multidisciplinary models, AIVC establishes a digital twin system to simulate cell functions and behaviors. AIVC still faces a number of pressing challenges that need to be addressed in its current development stage. In this review, we are proposing a unified definition and technical framework for AIVC and analyze in detail the cross-scale coupling mechanisms of the \"gene-protein-pathway-cell\" hierarchy. Furthermore, we decompose the technical construction framework of AIVC from cross-scale representation engineering, functional submodule design, and multi-component dynamic regulation mechanisms. Additionally, we summarize the existing models and datasets in the field to provide reference resources for researchers. Finally, we deeply discuss the challenges faced by AIVC, such as data heterogeneity and model interpretability, and aim to accelerate the research progress in the AIVC field while driving the life sciences to shift from observational analysis to a paradigm that integrates predictability and innovation. Despite being in the early stage, AIVC is a trending topic that has garnered widespread interest. This review aims to integrate existing models, datasets, and technical ideas to provide a unified framework for field development.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12967334/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147375856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liquid biopsies, coupled with analysis of copy number alterations (CNAs), have emerged as a promising tool for non-invasive monitoring of cancer progression and tumor composition. However, methods utilizing CNA data from liquid biopsies are limited by the low signal in the samples, caused by a low percentage of cancer DNA in the blood, and inherent noise introduced in the sequencing. To address this challenge, we developed BayesCNA, a method designed to improve signal extraction from low-pass liquid biopsy sequencing data, by utilizing a Bayesian changepoint detection algorithm. We use information of the posterior changepoint probabilities to identify likely changepoints, where a changepoint indicates a shift in the copy number state. The signal is then reconstructed using the identified partition. We show the effectiveness of the method on synthetically generated datasets and compare the method with state-of-the-art bioinformatics tools under noisy conditions. Our results show that this novel approach increases sensitivity in detecting CNAs, particularly in low-quality cases.
{"title":"Sensitive detection of copy number alterations in low-pass liquid biopsy sequencing data.","authors":"Lotta Eriksson, Eszter Lakatos","doi":"10.1093/bib/bbag111","DOIUrl":"10.1093/bib/bbag111","url":null,"abstract":"<p><p>Liquid biopsies, coupled with analysis of copy number alterations (CNAs), have emerged as a promising tool for non-invasive monitoring of cancer progression and tumor composition. However, methods utilizing CNA data from liquid biopsies are limited by the low signal in the samples, caused by a low percentage of cancer DNA in the blood, and inherent noise introduced in the sequencing. To address this challenge, we developed BayesCNA, a method designed to improve signal extraction from low-pass liquid biopsy sequencing data, by utilizing a Bayesian changepoint detection algorithm. We use information of the posterior changepoint probabilities to identify likely changepoints, where a changepoint indicates a shift in the copy number state. The signal is then reconstructed using the identified partition. We show the effectiveness of the method on synthetically generated datasets and compare the method with state-of-the-art bioinformatics tools under noisy conditions. Our results show that this novel approach increases sensitivity in detecting CNAs, particularly in low-quality cases.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12991053/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147466959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lineage-committed precursors are essential yet rarely identified in mammalian organogenesis, as they lack definitive molecular signatures required for conventional marker-based approaches. Herein, we developed iCommitted, an integrated multi-omics computational pipeline for precise identification of these precursors. iCommitted first reconstructs in vivo organogenesis by modeling the in vitro differentiation trajectory spanning naïve to terminally differentiated cells. It then integrates epigenomic (ATAC-seq/DNase-seq) and transcriptomic (RNA-seq) data to achieve standardized developmental staging and precursor identification. Applied to mammalian hematopoiesis, iCommitted robustly identified hematopoietic progenitors as the hematopoietic lineage-committed precursors. Subsequent cis-regulatory annotation generated a high-confidence atlas of 16 774 hematopoietic cis-regulatory elements. Functional analysis of the atlas further pinpointed a 218-bp hematopoietic enhancer (chr6:145 855 899-145 856 116) that regulates Bhlhe41 expression during lineage commitment. This study establishes a valuable approach for identifying lineage-committed precursors and elucidating regulatory mechanisms in mammalian organogenesis, offering broad utility in developmental biology.
谱系承诺的前体是必不可少的,但在哺乳动物器官发生中很少被识别,因为它们缺乏传统的基于标记的方法所需的明确分子特征。在此,我们开发了icomcommitted,一个集成的多组学计算管道,用于精确识别这些前体。icomcommitted首先通过模拟从naïve到终末分化细胞的体外分化轨迹来重建体内器官发生。然后整合表观基因组学(ATAC-seq/ dna -seq)和转录组学(RNA-seq)数据,实现标准化的发育分期和前体鉴定。将其应用于哺乳动物造血,研究人员强有力地确定了造血祖细胞作为造血谱系承诺的前体。随后的顺式调控注释生成了16774个造血顺式调控元件的高置信度图谱。图谱的功能分析进一步确定了一个218 bp的造血增强子(chr6:145 855 899-145 856 116),该增强子在谱系承诺过程中调节Bhlhe41的表达。本研究为鉴定谱系前体和阐明哺乳动物器官发生的调控机制建立了一种有价值的方法,在发育生物学中具有广泛的应用价值。
{"title":"Computational identification of lineage-committed precursors in mammalian organogenesis reveals a novel hematopoietic enhancer regulating Bhlhe41 expression.","authors":"Lihui Jin, Zhenyuan Han, Rebecca Hannah, Hongyu Shao, Junxin Huang, Shiying Wang, Weibin Zhang, Jiang Lin, Kun Sun, Yu Yu","doi":"10.1093/bib/bbag114","DOIUrl":"10.1093/bib/bbag114","url":null,"abstract":"<p><p>Lineage-committed precursors are essential yet rarely identified in mammalian organogenesis, as they lack definitive molecular signatures required for conventional marker-based approaches. Herein, we developed iCommitted, an integrated multi-omics computational pipeline for precise identification of these precursors. iCommitted first reconstructs in vivo organogenesis by modeling the in vitro differentiation trajectory spanning naïve to terminally differentiated cells. It then integrates epigenomic (ATAC-seq/DNase-seq) and transcriptomic (RNA-seq) data to achieve standardized developmental staging and precursor identification. Applied to mammalian hematopoiesis, iCommitted robustly identified hematopoietic progenitors as the hematopoietic lineage-committed precursors. Subsequent cis-regulatory annotation generated a high-confidence atlas of 16 774 hematopoietic cis-regulatory elements. Functional analysis of the atlas further pinpointed a 218-bp hematopoietic enhancer (chr6:145 855 899-145 856 116) that regulates Bhlhe41 expression during lineage commitment. This study establishes a valuable approach for identifying lineage-committed precursors and elucidating regulatory mechanisms in mammalian organogenesis, offering broad utility in developmental biology.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12991046/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147466979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}