Yi Jia, Chan Zhang, Han Zhang, Kang Dong, Yuruo Hu, Yinan Wang, Zicheng Zhao
Cancer classification is pivotal for precision oncology, yet traditional methods struggle with the molecular heterogeneity of tumors. Our study introduces a self-attention based Conv1D machine learning network designed for panel capture sequencing data, which is more commonly used in clinical settings. Combining clinical capture sequencing data and The Cancer Genome Atlas data, we achieved an overall classification accuracy of over 90%, with precision rates reaching 100% for cervical and gastric cancers. Additionally, recall rates were highest at 95.79% for gastric cancer and lowest at 77.46% for cervical cancer, demonstrating robust performance across various cancer types. The model identified key genes such as C3orf36, JHY, and TASP1, showing significant differences in mutation counts across cancers. High-impact gene enrichment analysis highlighted critical pathways like acute myeloid leukemia and adipocytokine signaling. This approach not only significantly improves the precision of cancer classification, demonstrating the potential for clinical application, but also enhances our understanding of cancer biology.
肿瘤分类是精确肿瘤学的关键,但传统的方法与肿瘤的分子异质性作斗争。我们的研究引入了一种基于自关注的Conv1D机器学习网络,该网络专为面板捕获测序数据而设计,该网络更常用于临床环境。结合临床捕获测序数据和The Cancer Genome Atlas数据,我们实现了90%以上的总体分类准确率,其中宫颈癌和胃癌的准确率达到100%。此外,胃癌的召回率最高,为95.79%,宫颈癌的召回率最低,为77.46%,在各种癌症类型中表现出强劲的表现。该模型确定了C3orf36、JHY和TASP1等关键基因,显示出不同癌症之间突变数量的显著差异。高影响基因富集分析强调了关键途径,如急性髓系白血病和脂肪细胞因子信号。该方法不仅显著提高了肿瘤分类的精度,显示了临床应用的潜力,而且增强了我们对癌症生物学的认识。
{"title":"Enhancing cancer classification accuracy with a self-attention network using panel capture sequencing data.","authors":"Yi Jia, Chan Zhang, Han Zhang, Kang Dong, Yuruo Hu, Yinan Wang, Zicheng Zhao","doi":"10.1093/bib/bbag120","DOIUrl":"https://doi.org/10.1093/bib/bbag120","url":null,"abstract":"<p><p>Cancer classification is pivotal for precision oncology, yet traditional methods struggle with the molecular heterogeneity of tumors. Our study introduces a self-attention based Conv1D machine learning network designed for panel capture sequencing data, which is more commonly used in clinical settings. Combining clinical capture sequencing data and The Cancer Genome Atlas data, we achieved an overall classification accuracy of over 90%, with precision rates reaching 100% for cervical and gastric cancers. Additionally, recall rates were highest at 95.79% for gastric cancer and lowest at 77.46% for cervical cancer, demonstrating robust performance across various cancer types. The model identified key genes such as C3orf36, JHY, and TASP1, showing significant differences in mutation counts across cancers. High-impact gene enrichment analysis highlighted critical pathways like acute myeloid leukemia and adipocytokine signaling. This approach not only significantly improves the precision of cancer classification, demonstrating the potential for clinical application, but also enhances our understanding of cancer biology.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147497746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Comprehensive pan-domain metagenomic classification is increasingly constrained by the memory and runtime costs of building and querying the rapidly expanding reference genome space. We introduce Kun-peng, a taxonomic classifier powered by an intelligent block-partitioned database structure and optimized search strategies, enabling ultra-scalable, memory-efficient pan-domain profiling. Using the Critical Assessment of Metagenome Interpretation II benchmark, Kun-peng substantially reduces the memory usage of database-building and querying by up to 24-fold, and accelerates sample classification by up to 4.73-fold compared with Kraken2. Kun-peng achieves competitive accuracy with fewer false positives than Kraken2, Centrifuger, and even KrakenUniq, while maintaining consistently high sensitivity across diverse datasets. In a real-world evaluation of 586 metagenomic samples spanning air, water, soil, and human-associated environments, we performed classification using a 4.3 TB pan-domain database comprising 204,477 genomes, which was built by Kun-peng with only 4.1 GB peak memory. Kun-peng processed each sample in 0.2-11.2 min with 4.0-35.4 GB peak memory, corresponding to a 54-473-fold reduction in memory usage relative to Kraken2. Compared with Sylph, Kun-peng achieved up to a 46-fold speedup while requiring 21-fold less memory. Kun-peng classified 69.8%-94.3% of reads, improving coverage by 20%-60% over the standard Kraken2 database with 62,026 genomes. This improvement reflects expanded reference coverage, although a small fraction of false positives is inherent to k-mer-based methods. Overall, Kun-peng effectively eliminates the long-standing memory bottleneck in pan-domain database building and classification, enabling rapid and scalable pan-domain taxonomic analysis of complex environmental, ecological, and exposomic sequencing datasets.
综合泛域宏基因组分类越来越受到构建和查询快速扩展的参考基因组空间的内存和运行时间成本的限制。我们介绍鲲鹏,一个由智能块分区数据库结构和优化的搜索策略驱动的分类分类器,实现超可扩展,内存高效的泛域分析。使用Critical Assessment of Metagenome Interpretation II基准,与Kraken2相比,鲲鹏将数据库构建和查询的内存使用量大幅降低了24倍,并将样本分类速度提高了4.73倍。与Kraken2,离心机,甚至KrakenUniq相比,鲲鹏实现了具有竞争力的准确性和更少的误报,同时在不同的数据集上保持一致的高灵敏度。在对空气、水、土壤和人类相关环境中的586个宏基因组样本的实际评估中,我们使用了由鲲鹏以4.1 GB峰值内存构建的包含204,477个基因组的4.3 TB泛域数据库进行分类。鲲鹏在0.2-11.2分钟内处理每个样本,峰值内存为4.0-35.4 GB,相对于Kraken2,内存使用减少了54-473倍。与Sylph相比,鲲鹏实现了高达46倍的加速,而需要的内存减少了21倍。鲲鹏分类了69.8%-94.3%的reads,比标准Kraken2数据库的62026个基因组的覆盖率提高了20%-60%。这一改进反映了参考覆盖率的扩大,尽管基于k-mer的方法固有的一小部分误报。总体而言,鲲鹏有效地解决了泛域数据库构建和分类中长期存在的内存瓶颈,实现了复杂环境、生态和暴露体测序数据集的快速、可扩展的泛域分类分析。
{"title":"Kun-peng enables scalable and accurate pan-domain metagenomic classification.","authors":"Qiong Chen, Boliang Zhang, Chen Peng, Jiajun Huang, Zhen Liu, Xiaotao Shen, Chao Jiang","doi":"10.1093/bib/bbag119","DOIUrl":"10.1093/bib/bbag119","url":null,"abstract":"<p><p>Comprehensive pan-domain metagenomic classification is increasingly constrained by the memory and runtime costs of building and querying the rapidly expanding reference genome space. We introduce Kun-peng, a taxonomic classifier powered by an intelligent block-partitioned database structure and optimized search strategies, enabling ultra-scalable, memory-efficient pan-domain profiling. Using the Critical Assessment of Metagenome Interpretation II benchmark, Kun-peng substantially reduces the memory usage of database-building and querying by up to 24-fold, and accelerates sample classification by up to 4.73-fold compared with Kraken2. Kun-peng achieves competitive accuracy with fewer false positives than Kraken2, Centrifuger, and even KrakenUniq, while maintaining consistently high sensitivity across diverse datasets. In a real-world evaluation of 586 metagenomic samples spanning air, water, soil, and human-associated environments, we performed classification using a 4.3 TB pan-domain database comprising 204,477 genomes, which was built by Kun-peng with only 4.1 GB peak memory. Kun-peng processed each sample in 0.2-11.2 min with 4.0-35.4 GB peak memory, corresponding to a 54-473-fold reduction in memory usage relative to Kraken2. Compared with Sylph, Kun-peng achieved up to a 46-fold speedup while requiring 21-fold less memory. Kun-peng classified 69.8%-94.3% of reads, improving coverage by 20%-60% over the standard Kraken2 database with 62,026 genomes. This improvement reflects expanded reference coverage, although a small fraction of false positives is inherent to k-mer-based methods. Overall, Kun-peng effectively eliminates the long-standing memory bottleneck in pan-domain database building and classification, enabling rapid and scalable pan-domain taxonomic analysis of complex environmental, ecological, and exposomic sequencing datasets.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12991049/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147466884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatially variable genes (SVGs) are essential for elucidating tissue organization within spatially resolved transcriptomics. While a number of computational methods have been developed for SVG identification, their reliance on algorithm-specific assumptions, such as predefined kernel functions or spatial neighborhood graphs, often results in substantial variability in sensitivity and inflated false discovery rates (FDRs) across heterogeneous datasets. To address this challenge, we here develop Castl, an ensemble-based framework for SVG identification that integrates multiple detection methods through statistically designed aggregation modules. Comprehensive evaluations on both simulated and real-world data demonstrate that Castl consistently identifies biologically meaningful spatial expression patterns, mitigates method-specific biases and effectively controls FDRs across various biological contexts, resolutions, and spatial technologies. This flexible, assumption-free framework offers a robust and standardized foundation for spatially informed feature discovery in complex biological systems.
{"title":"Castl: robust identification of spatially variable genes in spatial transcriptomics via an ensemble-based framework.","authors":"Yiyi Yu, Jiyuan Yang, Ping-An He, Xiaoqi Zheng","doi":"10.1093/bib/bbag074","DOIUrl":"10.1093/bib/bbag074","url":null,"abstract":"<p><p>Spatially variable genes (SVGs) are essential for elucidating tissue organization within spatially resolved transcriptomics. While a number of computational methods have been developed for SVG identification, their reliance on algorithm-specific assumptions, such as predefined kernel functions or spatial neighborhood graphs, often results in substantial variability in sensitivity and inflated false discovery rates (FDRs) across heterogeneous datasets. To address this challenge, we here develop Castl, an ensemble-based framework for SVG identification that integrates multiple detection methods through statistically designed aggregation modules. Comprehensive evaluations on both simulated and real-world data demonstrate that Castl consistently identifies biologically meaningful spatial expression patterns, mitigates method-specific biases and effectively controls FDRs across various biological contexts, resolutions, and spatial technologies. This flexible, assumption-free framework offers a robust and standardized foundation for spatially informed feature discovery in complex biological systems.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12963980/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147364150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The development of single-cell RNA sequencing (scRNA-seq) technology provides unprecedented opportunities for elucidating cell heterogeneity and gene expression. Identifying and discovering cell types through cell clustering is a crucial step in analyzing scRNA-seq data. However, the high-dimensionality nature and frequent dropout events of the data raise great challenges for cell clustering. Here, we propose a novel contrastive clustering framework called scSCCNIA (Similarity-matrix-based Contrastive Clustering with Neighbor Information Aggregation), for the accurate identification of cell clusters from scRNA-seq data. scSCCNIA adopts a Laplacian filter to conduct neighbor information aggregation, constructs different graph views by using special un-shared parameters Siamese encoders for data augmentation, and learns the latent low-dimensional embedding representations via similarity-matrix-based contrastive learning. Comparative analyses of multiple scRNA-seq datasets from different platforms and with varying cell numbers demonstrate that scSCCNIA outperforms existing methods in terms of cell clustering and marker gene identification. Furthermore, scSCCNIA reveals the heterogeneity and functional specificity of various cell types through Gene Ontology terms and Kyoto Encyclopedia of Genes and Genomes enrichment analyses. Overall, scSCCNIA is an effective algorithm for learning latent features from scRNA-seq data, enhancing cell type identification accuracy and facilitating downstream analyses of scRNA-seq data.
{"title":"scSCCNIA: similarity matrix based contrastive clustering with neighbor information aggregation for single-cell RNA sequencing data.","authors":"Jing Wang, Junfeng Xia, Yansen Su, Chun-Hou Zheng","doi":"10.1093/bib/bbag094","DOIUrl":"10.1093/bib/bbag094","url":null,"abstract":"<p><p>The development of single-cell RNA sequencing (scRNA-seq) technology provides unprecedented opportunities for elucidating cell heterogeneity and gene expression. Identifying and discovering cell types through cell clustering is a crucial step in analyzing scRNA-seq data. However, the high-dimensionality nature and frequent dropout events of the data raise great challenges for cell clustering. Here, we propose a novel contrastive clustering framework called scSCCNIA (Similarity-matrix-based Contrastive Clustering with Neighbor Information Aggregation), for the accurate identification of cell clusters from scRNA-seq data. scSCCNIA adopts a Laplacian filter to conduct neighbor information aggregation, constructs different graph views by using special un-shared parameters Siamese encoders for data augmentation, and learns the latent low-dimensional embedding representations via similarity-matrix-based contrastive learning. Comparative analyses of multiple scRNA-seq datasets from different platforms and with varying cell numbers demonstrate that scSCCNIA outperforms existing methods in terms of cell clustering and marker gene identification. Furthermore, scSCCNIA reveals the heterogeneity and functional specificity of various cell types through Gene Ontology terms and Kyoto Encyclopedia of Genes and Genomes enrichment analyses. Overall, scSCCNIA is an effective algorithm for learning latent features from scRNA-seq data, enhancing cell type identification accuracy and facilitating downstream analyses of scRNA-seq data.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12962064/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147364179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G protein-coupled receptors (GPCRs) are among the most important drug targets, and peptide therapeutics are rapidly emerging. However, accurate prediction of peptide-GPCR interactions (PepGI) remains challenging due to the scarcity of high-quality data and the poor generalization of existing drug-target interaction (DTI) models, which are largely trained on small molecule data. Here, we introduce a progressive fine-tuning framework with a dynamic parameter selection strategy that adaptively selects critical fine-tuning parameters using Fisher information. Our method begins with pretraining on a large small molecule-GPCR dataset, followed by intermediate fine-tuning on peptide-target data to alleviate the representation mismatch across heterogeneous ligand modalities. Finally, the task-specific fine-tuning is performed on the low-resource PepGI scenario. Extensive experiments show that our approach significantly outperforms baselines across multiple evaluation metrics, and exhibits robust generalization under few-shot and practical cold-start settings. Overall, this work offers an effective solution for low-resource peptide-GPCR prediction and presents a transferable framework for cross-structure DTI modeling.
{"title":"A progressive fine-tuning framework with dynamic parameter selection for low-resource peptide-GPCR interaction prediction.","authors":"Mingqing Liu, Jinhui Xu, Ji Liu","doi":"10.1093/bib/bbag116","DOIUrl":"10.1093/bib/bbag116","url":null,"abstract":"<p><p>G protein-coupled receptors (GPCRs) are among the most important drug targets, and peptide therapeutics are rapidly emerging. However, accurate prediction of peptide-GPCR interactions (PepGI) remains challenging due to the scarcity of high-quality data and the poor generalization of existing drug-target interaction (DTI) models, which are largely trained on small molecule data. Here, we introduce a progressive fine-tuning framework with a dynamic parameter selection strategy that adaptively selects critical fine-tuning parameters using Fisher information. Our method begins with pretraining on a large small molecule-GPCR dataset, followed by intermediate fine-tuning on peptide-target data to alleviate the representation mismatch across heterogeneous ligand modalities. Finally, the task-specific fine-tuning is performed on the low-resource PepGI scenario. Extensive experiments show that our approach significantly outperforms baselines across multiple evaluation metrics, and exhibits robust generalization under few-shot and practical cold-start settings. Overall, this work offers an effective solution for low-resource peptide-GPCR prediction and presents a transferable framework for cross-structure DTI modeling.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12991051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147466888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Publisher's Note: Addendum to Volume 26, Issue Supplement 1, December 2025, International Conference on Genome Informatics ISCB-Asia 2025 Abstract Book.","authors":"","doi":"10.1093/bib/bbag026","DOIUrl":"10.1093/bib/bbag026","url":null,"abstract":"","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12972659/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147389551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome-wide association studies (GWASs) have been conducted primarily in European (EUR) populations, limiting insights into underrepresented groups such as East Asian (EAS), but cross-ancestry GWASs have demonstrated high trans-ethnic genetic similarity between EUR and non-EUR populations. To enhance association analysis power in EAS populations, we propose tranScore, a novel summary-statistics-based transfer learning method that leverages trans-ethnic genetic similarity through hierarchical modeling. By considering EUR as auxiliary population, tranScore performs joint testing of genetic effects in auxiliary and target populations via well-established P-value combination procedures. Simulations demonstrate that tranScore maintains control of type I error rates and provides substantial power gains for diverse genetic architectures, showing robustness against various challenges including incomplete SNP overlap and effect heterogeneity. In the real-data application of eight diseases from the China Kadoorie Biobank (CKB), after incorporating the genetic information of the EUR population, tranScore identified significantly more genes than the traditional score test which ignored such information. Approximately 41.9% of discovered genes were replicated in the Biobank Japan cohort. Overall, tranScore represents a flexible and powerful statistical approach for association analysis of complex diseases and traits through transfer learning of shared genetic similarities between the auxiliary and target populations.
{"title":"An integrative association analysis for complex diseases in underrepresented groups by leveraging the trans-ethnic genetic similarity.","authors":"Shuo Zhang, Jike Qi, Yuchen Jiang, Hua Lin, Xinyi Wang, Ting Wang, Hongyan Cao, Ping Zeng","doi":"10.1093/bib/bbag103","DOIUrl":"10.1093/bib/bbag103","url":null,"abstract":"<p><p>Genome-wide association studies (GWASs) have been conducted primarily in European (EUR) populations, limiting insights into underrepresented groups such as East Asian (EAS), but cross-ancestry GWASs have demonstrated high trans-ethnic genetic similarity between EUR and non-EUR populations. To enhance association analysis power in EAS populations, we propose tranScore, a novel summary-statistics-based transfer learning method that leverages trans-ethnic genetic similarity through hierarchical modeling. By considering EUR as auxiliary population, tranScore performs joint testing of genetic effects in auxiliary and target populations via well-established P-value combination procedures. Simulations demonstrate that tranScore maintains control of type I error rates and provides substantial power gains for diverse genetic architectures, showing robustness against various challenges including incomplete SNP overlap and effect heterogeneity. In the real-data application of eight diseases from the China Kadoorie Biobank (CKB), after incorporating the genetic information of the EUR population, tranScore identified significantly more genes than the traditional score test which ignored such information. Approximately 41.9% of discovered genes were replicated in the Biobank Japan cohort. Overall, tranScore represents a flexible and powerful statistical approach for association analysis of complex diseases and traits through transfer learning of shared genetic similarities between the auxiliary and target populations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12971055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147389570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drug repurposing provides a cost-effective and time-efficient strategy to accelerate therapeutic discovery, yet most computational approaches fail to capture the multi-scale biomedical mechanisms underlying drug-disease associations, limiting interpretability. We introduce BioMNEDR (mechanism-guided network embedding for drug repurposing) that integrates heterogeneous biomedical networks through biologically curated meta-paths. BioMNEDR generates low-dimensional embeddings preserving protein-protein interactions and functional hierarchies. It further integrates multi-path predictions through an XGBoost classifier. The framework achieves state-of-the-art performance, consistently surpassing strong baselines across AUROC, AUPR, recall, and F1-score, while maintaining a balanced trade-off in precision. Case studies further highlight its practical utility, demonstrating the ability to rediscover approved drugs and prioritize promising candidates, such as cromoglicic acid for Alzheimer's disease. By explicitly modeling multi-scale mechanisms, BioMNEDR enhances both predictive accuracy and biomedical interpretability, offering a robust computational framework for systematic drug repurposing.
{"title":"BioMNEDR: mechanism-guided network embedding for drug repurposing.","authors":"Yizhou Zeng, Lei Wang, Xueming Liu","doi":"10.1093/bib/bbag101","DOIUrl":"10.1093/bib/bbag101","url":null,"abstract":"<p><p>Drug repurposing provides a cost-effective and time-efficient strategy to accelerate therapeutic discovery, yet most computational approaches fail to capture the multi-scale biomedical mechanisms underlying drug-disease associations, limiting interpretability. We introduce BioMNEDR (mechanism-guided network embedding for drug repurposing) that integrates heterogeneous biomedical networks through biologically curated meta-paths. BioMNEDR generates low-dimensional embeddings preserving protein-protein interactions and functional hierarchies. It further integrates multi-path predictions through an XGBoost classifier. The framework achieves state-of-the-art performance, consistently surpassing strong baselines across AUROC, AUPR, recall, and F1-score, while maintaining a balanced trade-off in precision. Case studies further highlight its practical utility, demonstrating the ability to rediscover approved drugs and prioritize promising candidates, such as cromoglicic acid for Alzheimer's disease. By explicitly modeling multi-scale mechanisms, BioMNEDR enhances both predictive accuracy and biomedical interpretability, offering a robust computational framework for systematic drug repurposing.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12971018/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147389581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingzhan Lu, Johan H Thygesen, Robin N Beaumont, Michael N Weedon, Harry D Green
As genome-wide association studies (GWAS) studies move from array-based genotyping to whole exome and genome sequencing, there is a significant increase in cost. Applying an appropriate technique for the selection of which controls to include, in large studies where more potential controls are available than needed for the study, may be a useful technique for minimizing resource intensity whilst maintaining statistical power. We evaluated three control selection strategies in prostate cancer GWAS using 15 250 UK Biobank cases: (a) all controls, (b) matched controls, and (c) random selection. Both (b) and (c) achieved comparable power in detecting significant loci relative to (a), but matched controls (b) showed greater consistency in identifying leading single nucleotide polymorphisms (SNPs). However, using (b) matched controls reduced discovery power by ~30% compared with (a) all controls, highlighting a trade-off. Matching controls (1:4 ratio) offers a cost-effective approach for targeted SNP analysis across phenotypes but may miss novel associations.
{"title":"Impact of control selection strategies on GWAS results: a study of prostate cancer in the UK Biobank.","authors":"Jingzhan Lu, Johan H Thygesen, Robin N Beaumont, Michael N Weedon, Harry D Green","doi":"10.1093/bib/bbag102","DOIUrl":"10.1093/bib/bbag102","url":null,"abstract":"<p><p>As genome-wide association studies (GWAS) studies move from array-based genotyping to whole exome and genome sequencing, there is a significant increase in cost. Applying an appropriate technique for the selection of which controls to include, in large studies where more potential controls are available than needed for the study, may be a useful technique for minimizing resource intensity whilst maintaining statistical power. We evaluated three control selection strategies in prostate cancer GWAS using 15 250 UK Biobank cases: (a) all controls, (b) matched controls, and (c) random selection. Both (b) and (c) achieved comparable power in detecting significant loci relative to (a), but matched controls (b) showed greater consistency in identifying leading single nucleotide polymorphisms (SNPs). However, using (b) matched controls reduced discovery power by ~30% compared with (a) all controls, highlighting a trade-off. Matching controls (1:4 ratio) offers a cost-effective approach for targeted SNP analysis across phenotypes but may miss novel associations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12971001/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147389643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daohong Gong, Xiaowei Xie, Jianxin Tang, Shiliang Li, Honglin Li
RNA-based technologies have demonstrated significant potential for diverse applications, ranging from vaccination to gene editing. However, their widespread adoption is limited by the critical challenge of efficient delivery. Lipid nanoparticles (LNPs) have emerged as a widely utilized RNA delivery system, yet their formulation design and optimization primarily rely on empirical trial-and-error, which is labor-intensive, time-consuming, and cost-prohibitive, thus hindering the rapid development of RNA therapeutics. To facilitate the early-stage design and optimization of LNPs for enhanced delivery efficiency, in this study, we construct LNPs-TE, a benchmark dataset comprising over 10 000 experimentally measured transfection efficiency (TE) values, and introduce LNPs integrated feature fusion Transformer (LIFT), a deep learning framework for LNPs TE prediction. Comprehensive experiments demonstrate that LIFT effectively integrates multidimensional molecular representations of ionizable lipids, the key component in LNPs formulation, achieving superior predictive performance, with an average Pearson correlation coefficient of 0.845 for regression and an area under the receiver operating characteristic curve (AUC-ROC) of 0.818 for multi-class classification across multiple datasets. Through scaffold-based splitting and activity cliff tasks, we further validated the exceptional generalization ability and robustness of LIFT, which achieved over a 10% improvement in the coefficient of determination (R2) compared with state-of-the-art baseline models, highlighting its potential as a practical and stable approach for the virtual screening of efficient LNPs formulation. The relevant data, model and code are made publicly available at https://github.com/U12458/LIFT.
{"title":"Transformer-based multidimensional feature fusion for accurate prediction of lipid nanoparticles transfection efficiency.","authors":"Daohong Gong, Xiaowei Xie, Jianxin Tang, Shiliang Li, Honglin Li","doi":"10.1093/bib/bbag092","DOIUrl":"10.1093/bib/bbag092","url":null,"abstract":"<p><p>RNA-based technologies have demonstrated significant potential for diverse applications, ranging from vaccination to gene editing. However, their widespread adoption is limited by the critical challenge of efficient delivery. Lipid nanoparticles (LNPs) have emerged as a widely utilized RNA delivery system, yet their formulation design and optimization primarily rely on empirical trial-and-error, which is labor-intensive, time-consuming, and cost-prohibitive, thus hindering the rapid development of RNA therapeutics. To facilitate the early-stage design and optimization of LNPs for enhanced delivery efficiency, in this study, we construct LNPs-TE, a benchmark dataset comprising over 10 000 experimentally measured transfection efficiency (TE) values, and introduce LNPs integrated feature fusion Transformer (LIFT), a deep learning framework for LNPs TE prediction. Comprehensive experiments demonstrate that LIFT effectively integrates multidimensional molecular representations of ionizable lipids, the key component in LNPs formulation, achieving superior predictive performance, with an average Pearson correlation coefficient of 0.845 for regression and an area under the receiver operating characteristic curve (AUC-ROC) of 0.818 for multi-class classification across multiple datasets. Through scaffold-based splitting and activity cliff tasks, we further validated the exceptional generalization ability and robustness of LIFT, which achieved over a 10% improvement in the coefficient of determination (R2) compared with state-of-the-art baseline models, highlighting its potential as a practical and stable approach for the virtual screening of efficient LNPs formulation. The relevant data, model and code are made publicly available at https://github.com/U12458/LIFT.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147324773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}