Wenli Zhai, Lingyun Sun, Wenwei Fang, Yidan Dong, Chunxiao Cheng, Yuanjiao Liu, Yuan Zhou, Jiadong Ji, Lang Wu, An Pan, Eric R Gamazon, Xiong-Fei Pan, Dan Zhou
Genetics-informed proteome-wide association studies (PWASs) provide an effective way to uncover proteomic mechanisms underlying complex diseases. PWAS relies on an ancestry-matched reference panel to model the impact of genetically determined protein expression on phenotype. However, reference panels from underrepresented populations remain relatively limited. We developed a multi-ancestry framework to enhance protein prediction in these populations by integrating diverse information-sharing strategies into a Multi-Ancestry Best-performing Model (MABM). Results indicated that MABM increased the prediction performance with higher performance observed in both cross-validation and an external dataset. Leveraging the Biobank Japan, we identified three times as many significant PWAS associations using MABM as using Lasso model. Notably, 47.5% of the MABM specific associations were reproduced in independent East Asian datasets with concordant effect sizes. Furthermore, MABM enhanced decision-making in gene/protein prioritization for functional validation for complex traits by validating well-established associations and uncovering novel trait-related candidates. The benefits of MABM were further validated in additional ancestries and demonstrated in brain tissue-based PWAS, underscoring its broad applicability. Our findings close critical gaps in multi-omics research among underrepresented populations and facilitate trait-relevant protein discovery in underrepresented populations.
{"title":"Cross-ancestry information transfer framework improves protein abundance prediction and protein-trait association identification.","authors":"Wenli Zhai, Lingyun Sun, Wenwei Fang, Yidan Dong, Chunxiao Cheng, Yuanjiao Liu, Yuan Zhou, Jiadong Ji, Lang Wu, An Pan, Eric R Gamazon, Xiong-Fei Pan, Dan Zhou","doi":"10.1093/bib/bbaf707","DOIUrl":"10.1093/bib/bbaf707","url":null,"abstract":"<p><p>Genetics-informed proteome-wide association studies (PWASs) provide an effective way to uncover proteomic mechanisms underlying complex diseases. PWAS relies on an ancestry-matched reference panel to model the impact of genetically determined protein expression on phenotype. However, reference panels from underrepresented populations remain relatively limited. We developed a multi-ancestry framework to enhance protein prediction in these populations by integrating diverse information-sharing strategies into a Multi-Ancestry Best-performing Model (MABM). Results indicated that MABM increased the prediction performance with higher performance observed in both cross-validation and an external dataset. Leveraging the Biobank Japan, we identified three times as many significant PWAS associations using MABM as using Lasso model. Notably, 47.5% of the MABM specific associations were reproduced in independent East Asian datasets with concordant effect sizes. Furthermore, MABM enhanced decision-making in gene/protein prioritization for functional validation for complex traits by validating well-established associations and uncovering novel trait-related candidates. The benefits of MABM were further validated in additional ancestries and demonstrated in brain tissue-based PWAS, underscoring its broad applicability. Our findings close critical gaps in multi-omics research among underrepresented populations and facilitate trait-relevant protein discovery in underrepresented populations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777707/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruhai Chen, Jiekai Chen, Lingling Shi, Jiangping He
Chromatin topological structure is critical for gene regulation. Hi-C based experiments have significantly advanced our understanding chromatin organization. Numerous computational tools have been developed to identify various structural levels of chromatin, ranging from compartments to loops. However, there remains a lack of specialized tools for identifying non-homologous inter-chromatin contacts (NHCCs), which play important roles in chromosome territories. In this study, we present iceDP, a tool that leverages the Density Peaks clustering algorithm to identify local high-density regions within inter-chromatin. These regions undergo two subsequent filtering steps to eliminate obvious false positives. When applied to three Hi-C datasets, iceDP accurately identified known NHCCs, including olfactory receptor genes in mature olfactory sensory neurons and Polycomb repressive complex-regulated developmental genes in mouse embryonic stem cells (mESCs). Notably, iceDP also uncovered previously unreported transcriptionally active NHCCs. Compared to diffHiC and FitHiC, iceDP exhibited superior performance with the highest positive rate. Moreover, iceDP is compatible with a wide range of chromatin conformation capture techniques, including in-situ Hi-C, Micro-C, HiChIP, and BL-HiC, demonstrating its versatility and utility.
{"title":"iceDP: identifying inter-chromatin engagement via density peaks clustering algorithm.","authors":"Ruhai Chen, Jiekai Chen, Lingling Shi, Jiangping He","doi":"10.1093/bib/bbaf704","DOIUrl":"10.1093/bib/bbaf704","url":null,"abstract":"<p><p>Chromatin topological structure is critical for gene regulation. Hi-C based experiments have significantly advanced our understanding chromatin organization. Numerous computational tools have been developed to identify various structural levels of chromatin, ranging from compartments to loops. However, there remains a lack of specialized tools for identifying non-homologous inter-chromatin contacts (NHCCs), which play important roles in chromosome territories. In this study, we present iceDP, a tool that leverages the Density Peaks clustering algorithm to identify local high-density regions within inter-chromatin. These regions undergo two subsequent filtering steps to eliminate obvious false positives. When applied to three Hi-C datasets, iceDP accurately identified known NHCCs, including olfactory receptor genes in mature olfactory sensory neurons and Polycomb repressive complex-regulated developmental genes in mouse embryonic stem cells (mESCs). Notably, iceDP also uncovered previously unreported transcriptionally active NHCCs. Compared to diffHiC and FitHiC, iceDP exhibited superior performance with the highest positive rate. Moreover, iceDP is compatible with a wide range of chromatin conformation capture techniques, including in-situ Hi-C, Micro-C, HiChIP, and BL-HiC, demonstrating its versatility and utility.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Circular RNA (circRNA) represents a critical class of regulatory RNAs with distinctive structural and functional features. The functions of circRNAs are modulated by various RNA modifications. Here, we present CircRM, a nanopore direct RNA sequencing-based computational method for profiling RNA modifications in circRNAs at single-base and single-molecule resolution. By integrating circRNA detection, read-level modification detection, and quantitative assessment of methylation rates, CircRM identified 427 high-confidence circRNAs and enables systematic characterization of three major modifications, m5C (AUC = 0.855), m6A (AUC = 0.817) and m1A (AUC = 0.769). It revealed distinct modification patterns compared with linear RNAs, highlighting RNA-type-specific regulations. We also identified the key features of circRNA-specific modifications, such as the enrichment near the back-splice junctions. Cross-cell line analyses further demonstrated conserved and cell-type-specific modification patterns. Together, these findings reveal, at the computational level, a unique epitranscriptomic landscape associated with circRNAs and establish CircRM as a powerful tool for advancing the study of RNA modifications in circular RNA biology. CircRM is free accessible at: https://github.com/jiayiAnnie17/CircRM.
{"title":"CircRM: profiling circular RNA modifications from nanopore direct RNA sequencing.","authors":"Jiayi Li, Shenglun Chen, Zhixing Wu, Haozhe Wang, Rong Xia, Jia Meng, Yuxin Zhang","doi":"10.1093/bib/bbaf726","DOIUrl":"10.1093/bib/bbaf726","url":null,"abstract":"<p><p>Circular RNA (circRNA) represents a critical class of regulatory RNAs with distinctive structural and functional features. The functions of circRNAs are modulated by various RNA modifications. Here, we present CircRM, a nanopore direct RNA sequencing-based computational method for profiling RNA modifications in circRNAs at single-base and single-molecule resolution. By integrating circRNA detection, read-level modification detection, and quantitative assessment of methylation rates, CircRM identified 427 high-confidence circRNAs and enables systematic characterization of three major modifications, m5C (AUC = 0.855), m6A (AUC = 0.817) and m1A (AUC = 0.769). It revealed distinct modification patterns compared with linear RNAs, highlighting RNA-type-specific regulations. We also identified the key features of circRNA-specific modifications, such as the enrichment near the back-splice junctions. Cross-cell line analyses further demonstrated conserved and cell-type-specific modification patterns. Together, these findings reveal, at the computational level, a unique epitranscriptomic landscape associated with circRNAs and establish CircRM as a powerful tool for advancing the study of RNA modifications in circular RNA biology. CircRM is free accessible at: https://github.com/jiayiAnnie17/CircRM.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798809/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145965377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nandini Chatterjee, Aleksandr Taraskin, Hridya Divakaran, Natalia Jaeger, Victor Enriquez, Catherine C Hedrick, Ahmad Alimadadi
The rapid evolution of single-cell technologies has generated vast, multimodal datasets encompassing genomic, transcriptomic, proteomic, and spatial information. However, high dimensionality, noise, and computational costs pose significant challenges, often introducing bias through traditional feature selection methods, such as highly variable gene selection. Unsupervised machine learning (ML) provides a solution by identifying informative features without predefined labels, thereby minimizing bias and capturing complex patterns. This paper reviews a diverse array of unsupervised ML techniques tailored for single-cell data. These approaches could enhance downstream analyses, such as clustering, dimensionality reduction, visualization, and data denoising, and reveal biologically relevant gene modules. Despite their advantages, challenges such as data sparsity, parameter tuning, and scalability persist. Future directions include integrating multiomic data, incorporating domain-specific knowledge, and developing scalable and interpretable algorithms. By addressing these challenges, unsupervised ML-based feature selection promises to revolutionize single-cell data analysis, driving unbiased insights into cellular heterogeneity and advancing biological discovery.
{"title":"Unveiling patterns: an exploration of machine learning techniques for unsupervised feature selection in single-cell data.","authors":"Nandini Chatterjee, Aleksandr Taraskin, Hridya Divakaran, Natalia Jaeger, Victor Enriquez, Catherine C Hedrick, Ahmad Alimadadi","doi":"10.1093/bib/bbag006","DOIUrl":"10.1093/bib/bbag006","url":null,"abstract":"<p><p>The rapid evolution of single-cell technologies has generated vast, multimodal datasets encompassing genomic, transcriptomic, proteomic, and spatial information. However, high dimensionality, noise, and computational costs pose significant challenges, often introducing bias through traditional feature selection methods, such as highly variable gene selection. Unsupervised machine learning (ML) provides a solution by identifying informative features without predefined labels, thereby minimizing bias and capturing complex patterns. This paper reviews a diverse array of unsupervised ML techniques tailored for single-cell data. These approaches could enhance downstream analyses, such as clustering, dimensionality reduction, visualization, and data denoising, and reveal biologically relevant gene modules. Despite their advantages, challenges such as data sparsity, parameter tuning, and scalability persist. Future directions include integrating multiomic data, incorporating domain-specific knowledge, and developing scalable and interpretable algorithms. By addressing these challenges, unsupervised ML-based feature selection promises to revolutionize single-cell data analysis, driving unbiased insights into cellular heterogeneity and advancing biological discovery.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834302/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146050340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predicting gene expression from genomic sequences is a central goal in computational genomics. Recent advances have demonstrated that deep learning models trained on large-scale epigenomic datasets hold significant promise for this task. However, their success heavily depends on how they are applied: most models are trained exclusively on a reference genome, limiting their ability to capture individual-specific genetic variation. Consequently, while these models perform well on reference genomes, they often struggle when applied to personal genomic data. This review discusses recent efforts to overcome these limitations and explores methods aimed at improving the prediction of personalized gene expression. In particular, we compare the performance of deep learning models with traditional expression quantitative trait loci-based linear approaches, examining novel fine-tuning strategies, and highlighting the emergence of genomic language models. Across multiple studies, we find that deep learning models still face significant challenges in outperforming linear models for cross-individual gene expression prediction. Despite ongoing advances in model architecture and training methodology, accurately and robustly predicting personalized gene expression remains an open challenge in the field.
{"title":"Personalized gene expression prediction in the era of deep learning: a review.","authors":"Viksar Dubey, Li Shen","doi":"10.1093/bib/bbag022","DOIUrl":"10.1093/bib/bbag022","url":null,"abstract":"<p><p>Predicting gene expression from genomic sequences is a central goal in computational genomics. Recent advances have demonstrated that deep learning models trained on large-scale epigenomic datasets hold significant promise for this task. However, their success heavily depends on how they are applied: most models are trained exclusively on a reference genome, limiting their ability to capture individual-specific genetic variation. Consequently, while these models perform well on reference genomes, they often struggle when applied to personal genomic data. This review discusses recent efforts to overcome these limitations and explores methods aimed at improving the prediction of personalized gene expression. In particular, we compare the performance of deep learning models with traditional expression quantitative trait loci-based linear approaches, examining novel fine-tuning strategies, and highlighting the emergence of genomic language models. Across multiple studies, we find that deep learning models still face significant challenges in outperforming linear models for cross-individual gene expression prediction. Despite ongoing advances in model architecture and training methodology, accurately and robustly predicting personalized gene expression remains an open challenge in the field.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12856953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weighted gene co-expression network analysis (WGCNA) is among the most widely employed methods in bioinformatics. WGCNA enables the identification of gene clusters (modules) exhibiting correlated expression patterns, the association of these modules with traits, and the exploration of candidate biomarker genes by focusing on hub genes within the modules. WGCNA has been successfully applied in diverse biological contexts. However, conventional algorithms manifest three principal limitations: the assumption of scale-free topology, the requirement for parameter tuning, and the neglect of regression line slopes. These limitations are addressed by SGCRNA. SGCRNA provides Julia functions for the analysis of co-expression networks derived from various types of biological data, such as gene expression data. The Julia packages and their source code are freely available at https://github.com/C37H41N2O6/SGCRNAs.jl.
{"title":"SGCRNA: spectral clustering-guided co-expression network analysis without scale-free constraints for multi-omic data.","authors":"Tatsunori Osone, Tomoka Takao, Shigeo Otake, Takeshi Takarada","doi":"10.1093/bib/bbag021","DOIUrl":"10.1093/bib/bbag021","url":null,"abstract":"<p><p>Weighted gene co-expression network analysis (WGCNA) is among the most widely employed methods in bioinformatics. WGCNA enables the identification of gene clusters (modules) exhibiting correlated expression patterns, the association of these modules with traits, and the exploration of candidate biomarker genes by focusing on hub genes within the modules. WGCNA has been successfully applied in diverse biological contexts. However, conventional algorithms manifest three principal limitations: the assumption of scale-free topology, the requirement for parameter tuning, and the neglect of regression line slopes. These limitations are addressed by SGCRNA. SGCRNA provides Julia functions for the analysis of co-expression networks derived from various types of biological data, such as gene expression data. The Julia packages and their source code are freely available at https://github.com/C37H41N2O6/SGCRNAs.jl.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12856952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Margherita A G Matarrese, Michela Quadrini, Nicole Luchetti, Federico Di Petta, Daniele Durante, Monica Ballarino, Letizia Chiodo, Luca Tesei
The discovery of long non-coding RNAs (lncRNA) has revealed additional layers of gene-expression control. Specific interactions of lncRNAs with DNA, RNAs, and RNA-binding proteins enable regulation in both cytoplasmic and nuclear compartments; e.g. a conserved triple-helix motif is essential for MALAT1 stability and oncogenic activity. Here, we present a secondary-structure-based framework to annotate and detect RNA triple helices. First, we extend the dot-bracket formalism with a third annotation line that encodes Hoogsteen contacts. Second, we introduce TripleMatcher, which searches for a triple-helix pattern, filters candidates by C1'-C1' distance thresholds, and merges overlaps into region-level zones. Using telomerase RNAs and RNA-stability elements with experimentally established triple helices (8 RNAs), TripleMatcher localized all annotated regions (structure-wise detection 8/8); geometric filtering removed most spurious candidates and improved precision (positive predictive value from 0.42 to 0.81) and overall accuracy (F$_{1}$ from 0.42 to 0.62) while maintaining sensitivity. Benchmarking eight predictors showed that pseudoknot-aware methods most reliably reproduce the local architecture required for detection, aligning secondary-structure quality with downstream triple-helix recovery. Applied prospectively, the framework identified candidate regions directly from predicted secondary structures and scaled to a screen of 4160 RNAs, where distance filtering reduced 150 990 (median per molecule: 108 [20-270]) raw candidates to 97 geometrically feasible regions across seven molecules, including human telomerase complexes. Together, the notation and TripleMatcher provide a concise route from secondary structure to a small, interpretable set of triple-helix candidates suitable for targeted experimental validation.
{"title":"Decoding RNA triple helices: identification from sequence and secondary structure.","authors":"Margherita A G Matarrese, Michela Quadrini, Nicole Luchetti, Federico Di Petta, Daniele Durante, Monica Ballarino, Letizia Chiodo, Luca Tesei","doi":"10.1093/bib/bbag009","DOIUrl":"10.1093/bib/bbag009","url":null,"abstract":"<p><p>The discovery of long non-coding RNAs (lncRNA) has revealed additional layers of gene-expression control. Specific interactions of lncRNAs with DNA, RNAs, and RNA-binding proteins enable regulation in both cytoplasmic and nuclear compartments; e.g. a conserved triple-helix motif is essential for MALAT1 stability and oncogenic activity. Here, we present a secondary-structure-based framework to annotate and detect RNA triple helices. First, we extend the dot-bracket formalism with a third annotation line that encodes Hoogsteen contacts. Second, we introduce TripleMatcher, which searches for a triple-helix pattern, filters candidates by C1'-C1' distance thresholds, and merges overlaps into region-level zones. Using telomerase RNAs and RNA-stability elements with experimentally established triple helices (8 RNAs), TripleMatcher localized all annotated regions (structure-wise detection 8/8); geometric filtering removed most spurious candidates and improved precision (positive predictive value from 0.42 to 0.81) and overall accuracy (F$_{1}$ from 0.42 to 0.62) while maintaining sensitivity. Benchmarking eight predictors showed that pseudoknot-aware methods most reliably reproduce the local architecture required for detection, aligning secondary-structure quality with downstream triple-helix recovery. Applied prospectively, the framework identified candidate regions directly from predicted secondary structures and scaled to a screen of 4160 RNAs, where distance filtering reduced 150 990 (median per molecule: 108 [20-270]) raw candidates to 97 geometrically feasible regions across seven molecules, including human telomerase complexes. Together, the notation and TripleMatcher provide a concise route from secondary structure to a small, interpretable set of triple-helix candidates suitable for targeted experimental validation.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834306/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146050324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenbo Zhang, Yihui Wang, Jin Liu, Bowen Ke, Jiancheng Lv, Xianggen Liu
Molecular property prediction is a critical task in computational chemistry and drug discovery. While deep learning has advanced this field, the increasing complexity of models contrasts with the scarcity of labeled data, leading to severe overfitting and limited generalization. In this paper, we propose TasProp, a task-specific pre-training strategy for molecular property prediction, particularly for the scenarios with small labeled datasets. To learn a robust molecular representation, TasProp first projects both labeled and unlabeled data into a unified latent space. Then, we introduce a task-specific contrastive loss that aligns closely with the final prediction task and apply it to the labeled data. This contrastive loss encourages the model to learn more cohesive and distinguishable molecular representations corresponding to property categories, which in turn, enhances the model's performance on downstream property prediction tasks. Additionally, we propose a novel data augmentation method, accompanied by a theoretical analysis, to mitigate the challenge of labeled data scarcity. With the task-specific pre-training and augmented data, TasProp outperforms the state-of-the-art methods on many molecular property prediction tasks, including three publicly available datasets and two curated datasets related to anesthesiology. Furthermore, we provide an interactive web resource to facilitate model exploration and application, allowing users to easily predict the properties of input molecules online.
{"title":"Task-specific pre-training for molecular property prediction.","authors":"Wenbo Zhang, Yihui Wang, Jin Liu, Bowen Ke, Jiancheng Lv, Xianggen Liu","doi":"10.1093/bib/bbag010","DOIUrl":"10.1093/bib/bbag010","url":null,"abstract":"<p><p>Molecular property prediction is a critical task in computational chemistry and drug discovery. While deep learning has advanced this field, the increasing complexity of models contrasts with the scarcity of labeled data, leading to severe overfitting and limited generalization. In this paper, we propose TasProp, a task-specific pre-training strategy for molecular property prediction, particularly for the scenarios with small labeled datasets. To learn a robust molecular representation, TasProp first projects both labeled and unlabeled data into a unified latent space. Then, we introduce a task-specific contrastive loss that aligns closely with the final prediction task and apply it to the labeled data. This contrastive loss encourages the model to learn more cohesive and distinguishable molecular representations corresponding to property categories, which in turn, enhances the model's performance on downstream property prediction tasks. Additionally, we propose a novel data augmentation method, accompanied by a theoretical analysis, to mitigate the challenge of labeled data scarcity. With the task-specific pre-training and augmented data, TasProp outperforms the state-of-the-art methods on many molecular property prediction tasks, including three publicly available datasets and two curated datasets related to anesthesiology. Furthermore, we provide an interactive web resource to facilitate model exploration and application, allowing users to easily predict the properties of input molecules online.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12853129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The value and nature of the representations learned during the pretraining of genomic language models (gLMs) remain actively debated. We introduce Nucleotide Generative Pretrained Transformer (GPT), a decoder-only transformer with single-nucleotide tokenization, to dissect the role of pretraining. Through experiments varying repetitive element (RE) weights during pretraining (0.0-1.0), comparative finetuning against random initialization, linear probing of internal representations, and sparse autoencoder (SAE)-based interpretability, we evaluated the impact of pretraining and how REs in genomic data influence model learning. Models with moderate RE downweighting (0.5) consistently achieved optimal performance across seven genomic classification tasks, with pretrained models providing substantial performance gains over baselines. SAE feature annotation via sequence alignment revealed substantial RE-associated patterns in the pretrained model internal representations, suggesting that REs-which comprise 30%-60% of mammalian genomes-may dominate the pretraining objective. Our findings support the utility of pretraining and underscore the need for pretraining strategies that better accommodate repetitive sequences across the genome while also fostering the learning of less common but biologically important representations. This study highlights a key challenge for gLMs: ensuring that models broadly learn functional genomic syntax beyond simply recognizing ubiquitous repeats.
{"title":"Probing genomic language models: Nucleotide Generative Pretrained Transformer and the role of pretraining in learned representations.","authors":"Shae M Mclaughlin, Daniel A Lim","doi":"10.1093/bib/bbag011","DOIUrl":"10.1093/bib/bbag011","url":null,"abstract":"<p><p>The value and nature of the representations learned during the pretraining of genomic language models (gLMs) remain actively debated. We introduce Nucleotide Generative Pretrained Transformer (GPT), a decoder-only transformer with single-nucleotide tokenization, to dissect the role of pretraining. Through experiments varying repetitive element (RE) weights during pretraining (0.0-1.0), comparative finetuning against random initialization, linear probing of internal representations, and sparse autoencoder (SAE)-based interpretability, we evaluated the impact of pretraining and how REs in genomic data influence model learning. Models with moderate RE downweighting (0.5) consistently achieved optimal performance across seven genomic classification tasks, with pretrained models providing substantial performance gains over baselines. SAE feature annotation via sequence alignment revealed substantial RE-associated patterns in the pretrained model internal representations, suggesting that REs-which comprise 30%-60% of mammalian genomes-may dominate the pretraining objective. Our findings support the utility of pretraining and underscore the need for pretraining strategies that better accommodate repetitive sequences across the genome while also fostering the learning of less common but biologically important representations. This study highlights a key challenge for gLMs: ensuring that models broadly learn functional genomic syntax beyond simply recognizing ubiquitous repeats.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866925/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent technological advances have expanded the availability of high-throughput biological datasets, opening the way to the reliable design of digital twins of biomedical systems or patients. Such computational tools represent key chemical reaction networks driving perturbation or drug response and can profoundly guide drug discovery and personalized therapeutics. Yet, their development still depends on laborious data integration by the human modeler, so that automated approaches are critically needed. The successes of data-driven system discovery in Physics, rooted in clean datasets and well-defined governing laws, have fueled interest in applying similar techniques in Biology, which presents unique challenges. Here, we reviewed 177 methodologies for automatically inferring digital twins from biological time series, which mostly involved symbolic or sparse regression, and recapitulated them in a Shiny app. We evaluated algorithms according to eight biological and methodological challenges, associated with integrating noisy/incomplete data, multiple conditions, prior knowledge, latent variables, or dealing with high dimensionality, unobserved variable derivatives, candidate library design, and uncertainty quantification. Upon these criteria, sparse regression generally outperformed symbolic regression, particularly when using Bayesian frameworks. Next, deep learning and large language models further emerge as innovative tools to integrate prior knowledge, although their reliability and consistency need to be improved. While no single method addresses all challenges, we argue that progress in learning digital twins will come from hybrid and modular frameworks combining chemical reaction network-based mechanistic grounding, Bayesian uncertainty quantification, and the generative and knowledge integration capacities of deep learning. To support their development, we further highlight key components required for future benchmark development to evaluate methods across all challenges.
{"title":"Data-driven discovery of digital twins in biomedical research.","authors":"Clémence Métayer, Annabelle Ballesta, Julien Martinelli","doi":"10.1093/bib/bbaf722","DOIUrl":"10.1093/bib/bbaf722","url":null,"abstract":"<p><p>Recent technological advances have expanded the availability of high-throughput biological datasets, opening the way to the reliable design of digital twins of biomedical systems or patients. Such computational tools represent key chemical reaction networks driving perturbation or drug response and can profoundly guide drug discovery and personalized therapeutics. Yet, their development still depends on laborious data integration by the human modeler, so that automated approaches are critically needed. The successes of data-driven system discovery in Physics, rooted in clean datasets and well-defined governing laws, have fueled interest in applying similar techniques in Biology, which presents unique challenges. Here, we reviewed 177 methodologies for automatically inferring digital twins from biological time series, which mostly involved symbolic or sparse regression, and recapitulated them in a Shiny app. We evaluated algorithms according to eight biological and methodological challenges, associated with integrating noisy/incomplete data, multiple conditions, prior knowledge, latent variables, or dealing with high dimensionality, unobserved variable derivatives, candidate library design, and uncertainty quantification. Upon these criteria, sparse regression generally outperformed symbolic regression, particularly when using Bayesian frameworks. Next, deep learning and large language models further emerge as innovative tools to integrate prior knowledge, although their reliability and consistency need to be improved. While no single method addresses all challenges, we argue that progress in learning digital twins will come from hybrid and modular frameworks combining chemical reaction network-based mechanistic grounding, Bayesian uncertainty quantification, and the generative and knowledge integration capacities of deep learning. To support their development, we further highlight key components required for future benchmark development to evaluate methods across all challenges.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12890721/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146156168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}