Circular RNA (circRNA) represents a critical class of regulatory RNAs with distinctive structural and functional features. The functions of circRNAs are modulated by various RNA modifications. Here, we present CircRM, a nanopore direct RNA sequencing-based computational method for profiling RNA modifications in circRNAs at single-base and single-molecule resolution. By integrating circRNA detection, read-level modification detection, and quantitative assessment of methylation rates, CircRM identified 427 high-confidence circRNAs and enables systematic characterization of three major modifications, m5C (AUC = 0.855), m6A (AUC = 0.817) and m1A (AUC = 0.769). It revealed distinct modification patterns compared with linear RNAs, highlighting RNA-type-specific regulations. We also identified the key features of circRNA-specific modifications, such as the enrichment near the back-splice junctions. Cross-cell line analyses further demonstrated conserved and cell-type-specific modification patterns. Together, these findings reveal, at the computational level, a unique epitranscriptomic landscape associated with circRNAs and establish CircRM as a powerful tool for advancing the study of RNA modifications in circular RNA biology. CircRM is free accessible at: https://github.com/jiayiAnnie17/CircRM.
{"title":"CircRM: profiling circular RNA modifications from nanopore direct RNA sequencing.","authors":"Jiayi Li, Shenglun Chen, Zhixing Wu, Haozhe Wang, Rong Xia, Jia Meng, Yuxin Zhang","doi":"10.1093/bib/bbaf726","DOIUrl":"10.1093/bib/bbaf726","url":null,"abstract":"<p><p>Circular RNA (circRNA) represents a critical class of regulatory RNAs with distinctive structural and functional features. The functions of circRNAs are modulated by various RNA modifications. Here, we present CircRM, a nanopore direct RNA sequencing-based computational method for profiling RNA modifications in circRNAs at single-base and single-molecule resolution. By integrating circRNA detection, read-level modification detection, and quantitative assessment of methylation rates, CircRM identified 427 high-confidence circRNAs and enables systematic characterization of three major modifications, m5C (AUC = 0.855), m6A (AUC = 0.817) and m1A (AUC = 0.769). It revealed distinct modification patterns compared with linear RNAs, highlighting RNA-type-specific regulations. We also identified the key features of circRNA-specific modifications, such as the enrichment near the back-splice junctions. Cross-cell line analyses further demonstrated conserved and cell-type-specific modification patterns. Together, these findings reveal, at the computational level, a unique epitranscriptomic landscape associated with circRNAs and establish CircRM as a powerful tool for advancing the study of RNA modifications in circular RNA biology. CircRM is free accessible at: https://github.com/jiayiAnnie17/CircRM.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798809/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145965377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nandini Chatterjee, Aleksandr Taraskin, Hridya Divakaran, Natalia Jaeger, Victor Enriquez, Catherine C Hedrick, Ahmad Alimadadi
The rapid evolution of single-cell technologies has generated vast, multimodal datasets encompassing genomic, transcriptomic, proteomic, and spatial information. However, high dimensionality, noise, and computational costs pose significant challenges, often introducing bias through traditional feature selection methods, such as highly variable gene selection. Unsupervised machine learning (ML) provides a solution by identifying informative features without predefined labels, thereby minimizing bias and capturing complex patterns. This paper reviews a diverse array of unsupervised ML techniques tailored for single-cell data. These approaches could enhance downstream analyses, such as clustering, dimensionality reduction, visualization, and data denoising, and reveal biologically relevant gene modules. Despite their advantages, challenges such as data sparsity, parameter tuning, and scalability persist. Future directions include integrating multiomic data, incorporating domain-specific knowledge, and developing scalable and interpretable algorithms. By addressing these challenges, unsupervised ML-based feature selection promises to revolutionize single-cell data analysis, driving unbiased insights into cellular heterogeneity and advancing biological discovery.
{"title":"Unveiling patterns: an exploration of machine learning techniques for unsupervised feature selection in single-cell data.","authors":"Nandini Chatterjee, Aleksandr Taraskin, Hridya Divakaran, Natalia Jaeger, Victor Enriquez, Catherine C Hedrick, Ahmad Alimadadi","doi":"10.1093/bib/bbag006","DOIUrl":"10.1093/bib/bbag006","url":null,"abstract":"<p><p>The rapid evolution of single-cell technologies has generated vast, multimodal datasets encompassing genomic, transcriptomic, proteomic, and spatial information. However, high dimensionality, noise, and computational costs pose significant challenges, often introducing bias through traditional feature selection methods, such as highly variable gene selection. Unsupervised machine learning (ML) provides a solution by identifying informative features without predefined labels, thereby minimizing bias and capturing complex patterns. This paper reviews a diverse array of unsupervised ML techniques tailored for single-cell data. These approaches could enhance downstream analyses, such as clustering, dimensionality reduction, visualization, and data denoising, and reveal biologically relevant gene modules. Despite their advantages, challenges such as data sparsity, parameter tuning, and scalability persist. Future directions include integrating multiomic data, incorporating domain-specific knowledge, and developing scalable and interpretable algorithms. By addressing these challenges, unsupervised ML-based feature selection promises to revolutionize single-cell data analysis, driving unbiased insights into cellular heterogeneity and advancing biological discovery.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834302/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146050340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predicting gene expression from genomic sequences is a central goal in computational genomics. Recent advances have demonstrated that deep learning models trained on large-scale epigenomic datasets hold significant promise for this task. However, their success heavily depends on how they are applied: most models are trained exclusively on a reference genome, limiting their ability to capture individual-specific genetic variation. Consequently, while these models perform well on reference genomes, they often struggle when applied to personal genomic data. This review discusses recent efforts to overcome these limitations and explores methods aimed at improving the prediction of personalized gene expression. In particular, we compare the performance of deep learning models with traditional expression quantitative trait loci-based linear approaches, examining novel fine-tuning strategies, and highlighting the emergence of genomic language models. Across multiple studies, we find that deep learning models still face significant challenges in outperforming linear models for cross-individual gene expression prediction. Despite ongoing advances in model architecture and training methodology, accurately and robustly predicting personalized gene expression remains an open challenge in the field.
{"title":"Personalized gene expression prediction in the era of deep learning: a review.","authors":"Viksar Dubey, Li Shen","doi":"10.1093/bib/bbag022","DOIUrl":"10.1093/bib/bbag022","url":null,"abstract":"<p><p>Predicting gene expression from genomic sequences is a central goal in computational genomics. Recent advances have demonstrated that deep learning models trained on large-scale epigenomic datasets hold significant promise for this task. However, their success heavily depends on how they are applied: most models are trained exclusively on a reference genome, limiting their ability to capture individual-specific genetic variation. Consequently, while these models perform well on reference genomes, they often struggle when applied to personal genomic data. This review discusses recent efforts to overcome these limitations and explores methods aimed at improving the prediction of personalized gene expression. In particular, we compare the performance of deep learning models with traditional expression quantitative trait loci-based linear approaches, examining novel fine-tuning strategies, and highlighting the emergence of genomic language models. Across multiple studies, we find that deep learning models still face significant challenges in outperforming linear models for cross-individual gene expression prediction. Despite ongoing advances in model architecture and training methodology, accurately and robustly predicting personalized gene expression remains an open challenge in the field.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12856953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weighted gene co-expression network analysis (WGCNA) is among the most widely employed methods in bioinformatics. WGCNA enables the identification of gene clusters (modules) exhibiting correlated expression patterns, the association of these modules with traits, and the exploration of candidate biomarker genes by focusing on hub genes within the modules. WGCNA has been successfully applied in diverse biological contexts. However, conventional algorithms manifest three principal limitations: the assumption of scale-free topology, the requirement for parameter tuning, and the neglect of regression line slopes. These limitations are addressed by SGCRNA. SGCRNA provides Julia functions for the analysis of co-expression networks derived from various types of biological data, such as gene expression data. The Julia packages and their source code are freely available at https://github.com/C37H41N2O6/SGCRNAs.jl.
{"title":"SGCRNA: spectral clustering-guided co-expression network analysis without scale-free constraints for multi-omic data.","authors":"Tatsunori Osone, Tomoka Takao, Shigeo Otake, Takeshi Takarada","doi":"10.1093/bib/bbag021","DOIUrl":"10.1093/bib/bbag021","url":null,"abstract":"<p><p>Weighted gene co-expression network analysis (WGCNA) is among the most widely employed methods in bioinformatics. WGCNA enables the identification of gene clusters (modules) exhibiting correlated expression patterns, the association of these modules with traits, and the exploration of candidate biomarker genes by focusing on hub genes within the modules. WGCNA has been successfully applied in diverse biological contexts. However, conventional algorithms manifest three principal limitations: the assumption of scale-free topology, the requirement for parameter tuning, and the neglect of regression line slopes. These limitations are addressed by SGCRNA. SGCRNA provides Julia functions for the analysis of co-expression networks derived from various types of biological data, such as gene expression data. The Julia packages and their source code are freely available at https://github.com/C37H41N2O6/SGCRNAs.jl.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12856952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Margherita A G Matarrese, Michela Quadrini, Nicole Luchetti, Federico Di Petta, Daniele Durante, Monica Ballarino, Letizia Chiodo, Luca Tesei
The discovery of long non-coding RNAs (lncRNA) has revealed additional layers of gene-expression control. Specific interactions of lncRNAs with DNA, RNAs, and RNA-binding proteins enable regulation in both cytoplasmic and nuclear compartments; e.g. a conserved triple-helix motif is essential for MALAT1 stability and oncogenic activity. Here, we present a secondary-structure-based framework to annotate and detect RNA triple helices. First, we extend the dot-bracket formalism with a third annotation line that encodes Hoogsteen contacts. Second, we introduce TripleMatcher, which searches for a triple-helix pattern, filters candidates by C1'-C1' distance thresholds, and merges overlaps into region-level zones. Using telomerase RNAs and RNA-stability elements with experimentally established triple helices (8 RNAs), TripleMatcher localized all annotated regions (structure-wise detection 8/8); geometric filtering removed most spurious candidates and improved precision (positive predictive value from 0.42 to 0.81) and overall accuracy (F$_{1}$ from 0.42 to 0.62) while maintaining sensitivity. Benchmarking eight predictors showed that pseudoknot-aware methods most reliably reproduce the local architecture required for detection, aligning secondary-structure quality with downstream triple-helix recovery. Applied prospectively, the framework identified candidate regions directly from predicted secondary structures and scaled to a screen of 4160 RNAs, where distance filtering reduced 150 990 (median per molecule: 108 [20-270]) raw candidates to 97 geometrically feasible regions across seven molecules, including human telomerase complexes. Together, the notation and TripleMatcher provide a concise route from secondary structure to a small, interpretable set of triple-helix candidates suitable for targeted experimental validation.
{"title":"Decoding RNA triple helices: identification from sequence and secondary structure.","authors":"Margherita A G Matarrese, Michela Quadrini, Nicole Luchetti, Federico Di Petta, Daniele Durante, Monica Ballarino, Letizia Chiodo, Luca Tesei","doi":"10.1093/bib/bbag009","DOIUrl":"10.1093/bib/bbag009","url":null,"abstract":"<p><p>The discovery of long non-coding RNAs (lncRNA) has revealed additional layers of gene-expression control. Specific interactions of lncRNAs with DNA, RNAs, and RNA-binding proteins enable regulation in both cytoplasmic and nuclear compartments; e.g. a conserved triple-helix motif is essential for MALAT1 stability and oncogenic activity. Here, we present a secondary-structure-based framework to annotate and detect RNA triple helices. First, we extend the dot-bracket formalism with a third annotation line that encodes Hoogsteen contacts. Second, we introduce TripleMatcher, which searches for a triple-helix pattern, filters candidates by C1'-C1' distance thresholds, and merges overlaps into region-level zones. Using telomerase RNAs and RNA-stability elements with experimentally established triple helices (8 RNAs), TripleMatcher localized all annotated regions (structure-wise detection 8/8); geometric filtering removed most spurious candidates and improved precision (positive predictive value from 0.42 to 0.81) and overall accuracy (F$_{1}$ from 0.42 to 0.62) while maintaining sensitivity. Benchmarking eight predictors showed that pseudoknot-aware methods most reliably reproduce the local architecture required for detection, aligning secondary-structure quality with downstream triple-helix recovery. Applied prospectively, the framework identified candidate regions directly from predicted secondary structures and scaled to a screen of 4160 RNAs, where distance filtering reduced 150 990 (median per molecule: 108 [20-270]) raw candidates to 97 geometrically feasible regions across seven molecules, including human telomerase complexes. Together, the notation and TripleMatcher provide a concise route from secondary structure to a small, interpretable set of triple-helix candidates suitable for targeted experimental validation.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834306/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146050324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenbo Zhang, Yihui Wang, Jin Liu, Bowen Ke, Jiancheng Lv, Xianggen Liu
Molecular property prediction is a critical task in computational chemistry and drug discovery. While deep learning has advanced this field, the increasing complexity of models contrasts with the scarcity of labeled data, leading to severe overfitting and limited generalization. In this paper, we propose TasProp, a task-specific pre-training strategy for molecular property prediction, particularly for the scenarios with small labeled datasets. To learn a robust molecular representation, TasProp first projects both labeled and unlabeled data into a unified latent space. Then, we introduce a task-specific contrastive loss that aligns closely with the final prediction task and apply it to the labeled data. This contrastive loss encourages the model to learn more cohesive and distinguishable molecular representations corresponding to property categories, which in turn, enhances the model's performance on downstream property prediction tasks. Additionally, we propose a novel data augmentation method, accompanied by a theoretical analysis, to mitigate the challenge of labeled data scarcity. With the task-specific pre-training and augmented data, TasProp outperforms the state-of-the-art methods on many molecular property prediction tasks, including three publicly available datasets and two curated datasets related to anesthesiology. Furthermore, we provide an interactive web resource to facilitate model exploration and application, allowing users to easily predict the properties of input molecules online.
{"title":"Task-specific pre-training for molecular property prediction.","authors":"Wenbo Zhang, Yihui Wang, Jin Liu, Bowen Ke, Jiancheng Lv, Xianggen Liu","doi":"10.1093/bib/bbag010","DOIUrl":"10.1093/bib/bbag010","url":null,"abstract":"<p><p>Molecular property prediction is a critical task in computational chemistry and drug discovery. While deep learning has advanced this field, the increasing complexity of models contrasts with the scarcity of labeled data, leading to severe overfitting and limited generalization. In this paper, we propose TasProp, a task-specific pre-training strategy for molecular property prediction, particularly for the scenarios with small labeled datasets. To learn a robust molecular representation, TasProp first projects both labeled and unlabeled data into a unified latent space. Then, we introduce a task-specific contrastive loss that aligns closely with the final prediction task and apply it to the labeled data. This contrastive loss encourages the model to learn more cohesive and distinguishable molecular representations corresponding to property categories, which in turn, enhances the model's performance on downstream property prediction tasks. Additionally, we propose a novel data augmentation method, accompanied by a theoretical analysis, to mitigate the challenge of labeled data scarcity. With the task-specific pre-training and augmented data, TasProp outperforms the state-of-the-art methods on many molecular property prediction tasks, including three publicly available datasets and two curated datasets related to anesthesiology. Furthermore, we provide an interactive web resource to facilitate model exploration and application, allowing users to easily predict the properties of input molecules online.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12853129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The value and nature of the representations learned during the pretraining of genomic language models (gLMs) remain actively debated. We introduce Nucleotide Generative Pretrained Transformer (GPT), a decoder-only transformer with single-nucleotide tokenization, to dissect the role of pretraining. Through experiments varying repetitive element (RE) weights during pretraining (0.0-1.0), comparative finetuning against random initialization, linear probing of internal representations, and sparse autoencoder (SAE)-based interpretability, we evaluated the impact of pretraining and how REs in genomic data influence model learning. Models with moderate RE downweighting (0.5) consistently achieved optimal performance across seven genomic classification tasks, with pretrained models providing substantial performance gains over baselines. SAE feature annotation via sequence alignment revealed substantial RE-associated patterns in the pretrained model internal representations, suggesting that REs-which comprise 30%-60% of mammalian genomes-may dominate the pretraining objective. Our findings support the utility of pretraining and underscore the need for pretraining strategies that better accommodate repetitive sequences across the genome while also fostering the learning of less common but biologically important representations. This study highlights a key challenge for gLMs: ensuring that models broadly learn functional genomic syntax beyond simply recognizing ubiquitous repeats.
{"title":"Probing genomic language models: Nucleotide Generative Pretrained Transformer and the role of pretraining in learned representations.","authors":"Shae M Mclaughlin, Daniel A Lim","doi":"10.1093/bib/bbag011","DOIUrl":"10.1093/bib/bbag011","url":null,"abstract":"<p><p>The value and nature of the representations learned during the pretraining of genomic language models (gLMs) remain actively debated. We introduce Nucleotide Generative Pretrained Transformer (GPT), a decoder-only transformer with single-nucleotide tokenization, to dissect the role of pretraining. Through experiments varying repetitive element (RE) weights during pretraining (0.0-1.0), comparative finetuning against random initialization, linear probing of internal representations, and sparse autoencoder (SAE)-based interpretability, we evaluated the impact of pretraining and how REs in genomic data influence model learning. Models with moderate RE downweighting (0.5) consistently achieved optimal performance across seven genomic classification tasks, with pretrained models providing substantial performance gains over baselines. SAE feature annotation via sequence alignment revealed substantial RE-associated patterns in the pretrained model internal representations, suggesting that REs-which comprise 30%-60% of mammalian genomes-may dominate the pretraining objective. Our findings support the utility of pretraining and underscore the need for pretraining strategies that better accommodate repetitive sequences across the genome while also fostering the learning of less common but biologically important representations. This study highlights a key challenge for gLMs: ensuring that models broadly learn functional genomic syntax beyond simply recognizing ubiquitous repeats.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866925/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent technological advances have expanded the availability of high-throughput biological datasets, opening the way to the reliable design of digital twins of biomedical systems or patients. Such computational tools represent key chemical reaction networks driving perturbation or drug response and can profoundly guide drug discovery and personalized therapeutics. Yet, their development still depends on laborious data integration by the human modeler, so that automated approaches are critically needed. The successes of data-driven system discovery in Physics, rooted in clean datasets and well-defined governing laws, have fueled interest in applying similar techniques in Biology, which presents unique challenges. Here, we reviewed 177 methodologies for automatically inferring digital twins from biological time series, which mostly involved symbolic or sparse regression, and recapitulated them in a Shiny app. We evaluated algorithms according to eight biological and methodological challenges, associated with integrating noisy/incomplete data, multiple conditions, prior knowledge, latent variables, or dealing with high dimensionality, unobserved variable derivatives, candidate library design, and uncertainty quantification. Upon these criteria, sparse regression generally outperformed symbolic regression, particularly when using Bayesian frameworks. Next, deep learning and large language models further emerge as innovative tools to integrate prior knowledge, although their reliability and consistency need to be improved. While no single method addresses all challenges, we argue that progress in learning digital twins will come from hybrid and modular frameworks combining chemical reaction network-based mechanistic grounding, Bayesian uncertainty quantification, and the generative and knowledge integration capacities of deep learning. To support their development, we further highlight key components required for future benchmark development to evaluate methods across all challenges.
{"title":"Data-driven discovery of digital twins in biomedical research.","authors":"Clémence Métayer, Annabelle Ballesta, Julien Martinelli","doi":"10.1093/bib/bbaf722","DOIUrl":"10.1093/bib/bbaf722","url":null,"abstract":"<p><p>Recent technological advances have expanded the availability of high-throughput biological datasets, opening the way to the reliable design of digital twins of biomedical systems or patients. Such computational tools represent key chemical reaction networks driving perturbation or drug response and can profoundly guide drug discovery and personalized therapeutics. Yet, their development still depends on laborious data integration by the human modeler, so that automated approaches are critically needed. The successes of data-driven system discovery in Physics, rooted in clean datasets and well-defined governing laws, have fueled interest in applying similar techniques in Biology, which presents unique challenges. Here, we reviewed 177 methodologies for automatically inferring digital twins from biological time series, which mostly involved symbolic or sparse regression, and recapitulated them in a Shiny app. We evaluated algorithms according to eight biological and methodological challenges, associated with integrating noisy/incomplete data, multiple conditions, prior knowledge, latent variables, or dealing with high dimensionality, unobserved variable derivatives, candidate library design, and uncertainty quantification. Upon these criteria, sparse regression generally outperformed symbolic regression, particularly when using Bayesian frameworks. Next, deep learning and large language models further emerge as innovative tools to integrate prior knowledge, although their reliability and consistency need to be improved. While no single method addresses all challenges, we argue that progress in learning digital twins will come from hybrid and modular frameworks combining chemical reaction network-based mechanistic grounding, Bayesian uncertainty quantification, and the generative and knowledge integration capacities of deep learning. To support their development, we further highlight key components required for future benchmark development to evaluate methods across all challenges.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12890721/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146156168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Gan, Yangsong He, Pu Zhao, Wai-Ki Ching, Yushan Qiu
Alternative splicing (AS) is a key driver of transcriptomic diversity and plays a pivotal role in epithelial-mesenchymal transition (EMT). During EMT, dynamic splicing changes contribute to cell plasticity and metastasis, yet the upstream regulatory logic remains unclear. Although transcription factors (TFs) are thought to influence AS programs, they typically act through RNA-binding proteins (RBPs), forming a hierarchical TF$rightarrow $RBP$rightarrow $AS cascade. Current computational strategies struggle to recover such multi-layered regulation from bulk cross-sectional data, limiting our ability to identify TFs that ultimately control EMT-related AS events. To address this gap, we developed CTAS, a network control theory-based approach to identify key regulatory TFs of AS events during EMT. CTAS integrates pseudotime ordering, trend analysis, sparse directed network inference, and control-theoretic screening into a unified framework. In simulations, CTAS reconstructs EMT trajectories with Spearman's $rho = 0.99946$ and directed networks with ROC AUC = 89.9%, and remains robust under noise. Applied to a TCGA BRCA cohort, CTAS builds a directed TF$to $RBP$to $AS network and identifies HOXA3 (1.00), PRDM8 (0.86), and TWIST2 (0.83) as top TF controllers, alongside significant dynamic shifts in nine AS events detected by Wilcoxon test ($P <.05$). A focused CD44 subnetwork further highlights ZNF521 (0.86) and HIC1 (0.65) as candidate regulators. These findings demonstrate that CTAS transforms cross-sectional data into dynamic regulatory insights and yields experimentally testable TFs that control AS during EMT.
选择性剪接(AS)是转录组多样性的关键驱动因素,在上皮-间质转化(EMT)中起着关键作用。在EMT过程中,动态剪接变化有助于细胞可塑性和转移,但上游调控逻辑尚不清楚。虽然转录因子(TF)被认为影响AS程序,但它们通常通过rna结合蛋白(RBP)起作用,形成分层TF $rightarrow $ RBP $rightarrow $ AS级联。目前的计算策略很难从大量横截面数据中恢复这种多层调控,这限制了我们识别最终控制emt相关AS事件的tf的能力。为了解决这一差距,我们开发了CTAS,这是一种基于网络控制理论的方法,用于识别EMT期间AS事件的关键调节tf。CTAS将伪时间排序、趋势分析、稀疏定向网络推理和控制理论筛选集成到一个统一的框架中。在模拟中,CTAS使用Spearman's $rho = 0.99946$和ROC AUC = 89.9的定向网络重建EMT轨迹%, and remains robust under noise. Applied to a TCGA BRCA cohort, CTAS builds a directed TF$to $RBP$to $AS network and identifies HOXA3 (1.00), PRDM8 (0.86), and TWIST2 (0.83) as top TF controllers, alongside significant dynamic shifts in nine AS events detected by Wilcoxon test ($P
{"title":"CTAS: a network control theory-based approach to identify key regulatory TFs of AS events during epithelial-mesenchymal transition.","authors":"Yan Gan, Yangsong He, Pu Zhao, Wai-Ki Ching, Yushan Qiu","doi":"10.1093/bib/bbag042","DOIUrl":"10.1093/bib/bbag042","url":null,"abstract":"<p><p>Alternative splicing (AS) is a key driver of transcriptomic diversity and plays a pivotal role in epithelial-mesenchymal transition (EMT). During EMT, dynamic splicing changes contribute to cell plasticity and metastasis, yet the upstream regulatory logic remains unclear. Although transcription factors (TFs) are thought to influence AS programs, they typically act through RNA-binding proteins (RBPs), forming a hierarchical TF$rightarrow $RBP$rightarrow $AS cascade. Current computational strategies struggle to recover such multi-layered regulation from bulk cross-sectional data, limiting our ability to identify TFs that ultimately control EMT-related AS events. To address this gap, we developed CTAS, a network control theory-based approach to identify key regulatory TFs of AS events during EMT. CTAS integrates pseudotime ordering, trend analysis, sparse directed network inference, and control-theoretic screening into a unified framework. In simulations, CTAS reconstructs EMT trajectories with Spearman's $rho = 0.99946$ and directed networks with ROC AUC = 89.9%, and remains robust under noise. Applied to a TCGA BRCA cohort, CTAS builds a directed TF$to $RBP$to $AS network and identifies HOXA3 (1.00), PRDM8 (0.86), and TWIST2 (0.83) as top TF controllers, alongside significant dynamic shifts in nine AS events detected by Wilcoxon test ($P <.05$). A focused CD44 subnetwork further highlights ZNF521 (0.86) and HIC1 (0.65) as candidate regulators. These findings demonstrate that CTAS transforms cross-sectional data into dynamic regulatory insights and yields experimentally testable TFs that control AS during EMT.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12888823/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146156224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cancer is a highly heterogeneous disease characterized by complex molecular changes. Subtypes identified through multi-omics data hold significant promise for improving prognosis and facilitating personalized precision treatment. Recent multi-omics integration methods have mostly focused on capturing complementary information from different data types, often overlooking potential interactions between omics data. Here we develop a novel method named interactive multi-kernel learning (iMKL), which incorporates omics-omics interactions alongside heterogeneous data types under the unsupervised multi-kernel learning framework, to improve subtype identification. Using the sample-similarity kernel for each dataset, we propose a joint Hadamard product strategy to capture higher-order interactive effects from different omics data types. We applied iMKL to two renal cell carcinoma (RCC) datasets-clear renal cell carcinoma (ccRCC) and type II papillary renal cell carcinoma (type II pRCC)-both including miRNA expression, mRNA expression, and DNA methylation data. Stability analysis through random sampling of patients or features demonstrated that iMKL exhibits strong robustness and accuracy in identifying patient subtypes. The identified subtypes revealed dramatic differences in patient survival, with both ccRCC and type II pRCC classified into three distinct subtypes. The findings in the real application highlight potential biomarkers associated with adverse patient outcomes and demonstrate substantial advancement in cancer subtype identification. The iMKL method effectively identifies tumor molecular subtypes that are strongly associated with clinical features and survival rates, providing valuable insights for accurate cancer subtyping, clinical decision-making, and the realization of personalized treatment strategies.
{"title":"Multi-omics data integration for enhanced cancer subtyping via interactive multi-kernel learning.","authors":"Hongyan Cao, Tong Wang, Zhaoyang Xu, Xin Zhao, Gaiqin Liu, Xiaoling Yang, Ruiling Fang, Yanhong Luo, Ping Zeng, Hongmei Yu, Yanbo Zhang, Yuehua Cui","doi":"10.1093/bib/bbaf687","DOIUrl":"10.1093/bib/bbaf687","url":null,"abstract":"<p><p>Cancer is a highly heterogeneous disease characterized by complex molecular changes. Subtypes identified through multi-omics data hold significant promise for improving prognosis and facilitating personalized precision treatment. Recent multi-omics integration methods have mostly focused on capturing complementary information from different data types, often overlooking potential interactions between omics data. Here we develop a novel method named interactive multi-kernel learning (iMKL), which incorporates omics-omics interactions alongside heterogeneous data types under the unsupervised multi-kernel learning framework, to improve subtype identification. Using the sample-similarity kernel for each dataset, we propose a joint Hadamard product strategy to capture higher-order interactive effects from different omics data types. We applied iMKL to two renal cell carcinoma (RCC) datasets-clear renal cell carcinoma (ccRCC) and type II papillary renal cell carcinoma (type II pRCC)-both including miRNA expression, mRNA expression, and DNA methylation data. Stability analysis through random sampling of patients or features demonstrated that iMKL exhibits strong robustness and accuracy in identifying patient subtypes. The identified subtypes revealed dramatic differences in patient survival, with both ccRCC and type II pRCC classified into three distinct subtypes. The findings in the real application highlight potential biomarkers associated with adverse patient outcomes and demonstrate substantial advancement in cancer subtype identification. The iMKL method effectively identifies tumor molecular subtypes that are strongly associated with clinical features and survival rates, providing valuable insights for accurate cancer subtyping, clinical decision-making, and the realization of personalized treatment strategies.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 6","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12710476/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145773732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}