Wei Lan, Guohang He, Lingzhi Zhu, Ruiqing Zheng, Min Li, Yi Pan
Spatial transcriptomics (ST) offers unprecedented opportunities to decode the spatial organization of gene expression, yet the inherent noise and complexity of ST data pose substantial challenges for accurate analysis. Here, we present DACN, a unified framework that integrates an improved adversarial autoencoder (AAE) with a graph convolutional network (GCN) to robustly analyze ST data across varying resolutions and throughputs. DACN employs a hybrid encoder that couples multi-head attention with residual connections to capture fine-grained local expression patterns while retaining critical global information. The hybrid encoder and generator jointly construct the AAE module, which denoises expression profiles and learns stable latent representations. The GCN component further exploits spatial neighborhood relationships to refine these embeddings. Across multiple ST datasets with varying resolutions, DACN consistently outperforms existing methods in accuracy and robustness. All code and datasets are publicly available at https://github.com/lanbiolab/DACN.
{"title":"An unsupervised method for spatial transcriptomics analysis based on adversarial autoencoder.","authors":"Wei Lan, Guohang He, Lingzhi Zhu, Ruiqing Zheng, Min Li, Yi Pan","doi":"10.1093/bib/bbag070","DOIUrl":"10.1093/bib/bbag070","url":null,"abstract":"<p><p>Spatial transcriptomics (ST) offers unprecedented opportunities to decode the spatial organization of gene expression, yet the inherent noise and complexity of ST data pose substantial challenges for accurate analysis. Here, we present DACN, a unified framework that integrates an improved adversarial autoencoder (AAE) with a graph convolutional network (GCN) to robustly analyze ST data across varying resolutions and throughputs. DACN employs a hybrid encoder that couples multi-head attention with residual connections to capture fine-grained local expression patterns while retaining critical global information. The hybrid encoder and generator jointly construct the AAE module, which denoises expression profiles and learns stable latent representations. The GCN component further exploits spatial neighborhood relationships to refine these embeddings. Across multiple ST datasets with varying resolutions, DACN consistently outperforms existing methods in accuracy and robustness. All code and datasets are publicly available at https://github.com/lanbiolab/DACN.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12919444/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146225605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A comprehensive understanding of cancer progression requires integrating tissue morphology with spatial molecular profiles. We present SHEST, a multi-task profiling framework that leverages haematoxylin and eosin morphology to predict cellular composition and reconstruct spatial gene expression at single-cell resolution. SHEST employs a quadruple-tile input capturing nuclear and contextual information, combined with a neighbourhood-informed clustering algorithm to filter ambiguous cellular signals. It comprises a shared morphological encoder with two task-specific heads: a classifier for cell-type prediction and a reconstructor for gene expression. Multi-task optimization uses cross-entropy and zero-inflated negative binomial losses, specifically addressing the sparsity of spatial transcriptomic data. Evaluation on human lung adenocarcinoma datasets demonstrated high accuracy for the principal reciprocal constituents of the tumour-immune axis ($F_{1}$: 0.97 for tumour cells and 0.91 for lymphocytes). External validation confirmed its generalizability, revealing alveolar cells and their early neoplastic transitions. Reconstructed gene expression reproduced spatially resolved, cell-type-specific marker patterns-EPCAM in tumour cells, LTBP2 in fibroblasts, and CD3E in lymphocytes-recovering biologically coherent transcriptional architecture. SHEST also preserved distance-dependent spatial relationships and gene-level autocorrelation, reflecting the multicellular niche structure of the tumour microenvironment. By unifying cell-type identification, gene expression reconstruction, and spatial mapping within a single interpretable framework, SHEST provides a synergistic and cost-efficient bridge between histopathology and spatial transcriptomics. This approach facilitates comprehensive tissue characterization and forms a foundation for precision oncology through spatially informed, cell-level insights into tumour-immune ecosystems.
{"title":"SHEST: single-cell-level artificial intelligence from haematoxylin and eosin morphology for cell-type prediction and spatial transcriptomics reconstruction.","authors":"Hoyeon Jeong, Junghan Oh, Donggeon Lee, Jae Hwan Kang, Yoon-La Choi","doi":"10.1093/bib/bbag037","DOIUrl":"10.1093/bib/bbag037","url":null,"abstract":"<p><p>A comprehensive understanding of cancer progression requires integrating tissue morphology with spatial molecular profiles. We present SHEST, a multi-task profiling framework that leverages haematoxylin and eosin morphology to predict cellular composition and reconstruct spatial gene expression at single-cell resolution. SHEST employs a quadruple-tile input capturing nuclear and contextual information, combined with a neighbourhood-informed clustering algorithm to filter ambiguous cellular signals. It comprises a shared morphological encoder with two task-specific heads: a classifier for cell-type prediction and a reconstructor for gene expression. Multi-task optimization uses cross-entropy and zero-inflated negative binomial losses, specifically addressing the sparsity of spatial transcriptomic data. Evaluation on human lung adenocarcinoma datasets demonstrated high accuracy for the principal reciprocal constituents of the tumour-immune axis ($F_{1}$: 0.97 for tumour cells and 0.91 for lymphocytes). External validation confirmed its generalizability, revealing alveolar cells and their early neoplastic transitions. Reconstructed gene expression reproduced spatially resolved, cell-type-specific marker patterns-EPCAM in tumour cells, LTBP2 in fibroblasts, and CD3E in lymphocytes-recovering biologically coherent transcriptional architecture. SHEST also preserved distance-dependent spatial relationships and gene-level autocorrelation, reflecting the multicellular niche structure of the tumour microenvironment. By unifying cell-type identification, gene expression reconstruction, and spatial mapping within a single interpretable framework, SHEST provides a synergistic and cost-efficient bridge between histopathology and spatial transcriptomics. This approach facilitates comprehensive tissue characterization and forms a foundation for precision oncology through spatially informed, cell-level insights into tumour-immune ecosystems.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12910627/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146212133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Partin, Priyanka Vasanthakumari, Oleksandr Narykov, Andreas Wilke, Natasha Koussa, Sara E Jones, Yitan Zhu, Jamie C Overbeek, Rajeev Jain, Gayara Demini Fernando, Cesar Sanchez-Villalobos, Cristina Garcia-Cardona, Jamaludin Mohd-Yusof, Nicholas Chia, Justin M Wozniak, Souparno Ghosh, Ranadip Pal, Thomas S Brettin, M Ryan Weil, Rick L Stevens
Deep learning and machine learning models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it difficult to assess true predictive capabilities. In this work, we introduce a benchmarking framework for evaluating cross-dataset prediction generalization in DRP models. Our framework incorporates five publicly available drug screening datasets, seven standardized DRP models, and a scalable workflow for systematic evaluation. To assess model generalization, we introduce a set of evaluation metrics that quantify both absolute performance (e.g. predictive accuracy across datasets) and relative performance (e.g. performance drop compared to within-dataset results), enabling a more comprehensive assessment of model transferability. Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments. While several models demonstrate relatively strong cross-dataset generalization, no single model consistently outperforms across all datasets. Furthermore, we identify CTRPv2 as the most effective source dataset for training, yielding higher generalization scores across target datasets. By sharing this standardized evaluation framework with the community, our study aims to establish a rigorous foundation for model comparison, and accelerate the development of robust DRP models for real-world applications.
{"title":"Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis.","authors":"Alexander Partin, Priyanka Vasanthakumari, Oleksandr Narykov, Andreas Wilke, Natasha Koussa, Sara E Jones, Yitan Zhu, Jamie C Overbeek, Rajeev Jain, Gayara Demini Fernando, Cesar Sanchez-Villalobos, Cristina Garcia-Cardona, Jamaludin Mohd-Yusof, Nicholas Chia, Justin M Wozniak, Souparno Ghosh, Ranadip Pal, Thomas S Brettin, M Ryan Weil, Rick L Stevens","doi":"10.1093/bib/bbaf667","DOIUrl":"10.1093/bib/bbaf667","url":null,"abstract":"<p><p>Deep learning and machine learning models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it difficult to assess true predictive capabilities. In this work, we introduce a benchmarking framework for evaluating cross-dataset prediction generalization in DRP models. Our framework incorporates five publicly available drug screening datasets, seven standardized DRP models, and a scalable workflow for systematic evaluation. To assess model generalization, we introduce a set of evaluation metrics that quantify both absolute performance (e.g. predictive accuracy across datasets) and relative performance (e.g. performance drop compared to within-dataset results), enabling a more comprehensive assessment of model transferability. Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments. While several models demonstrate relatively strong cross-dataset generalization, no single model consistently outperforms across all datasets. Furthermore, we identify CTRPv2 as the most effective source dataset for training, yielding higher generalization scores across target datasets. By sharing this standardized evaluation framework with the community, our study aims to establish a rigorous foundation for model comparison, and accelerate the development of robust DRP models for real-world applications.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12794626/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145958987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Wang, Lei Wang, Nan Sheng, Jie Hong, Yunzhi Liu, Pengze Wu, XinFei Wang, Shuyan Zhang, Chen Cao
Alternative polyadenylation (APA) of $3^{prime}$untranslated regions ($3^{prime}$UTRs) is a pervasive mechanism that regulates mRNA stability, localization, and translational efficiency by generating isoforms with distinct $3^{prime}$UTR lengths and regulatory element composition. Despite its critical role in fine-tuning gene expression, APA has been largely overlooked in transcriptome-wide association studies (TWAS), which traditionally rely on linear models of SNP effects. To bridge this gap, we developed ASTWAS, a two-stage framework that first trains APA usage prediction models (BLUP, Elastic Net, LASSO, and TOP1) to quantify SNP impacts on distal poly(A) site choice via the percentage of distal poly(A) site usage index, and then aggregates weighted SNP effects within a kernel method to capture both linear and nonlinear genetic interactions. In extensive simulations spanning additive, epistatic, heterogeneous, compensatory, and single-variant architectures under both pleiotropy and causality scenarios, ASTWAS shows higher statistical power than linear APA-TWAS ($3^{prime}$aTWAS), especially at low heritability and in the presence of SNP interactions. Applied to WTCCC type 1 diabetes and rheumatoid arthritis cohorts, ASTWAS not only rediscovers known susceptibility genes but also suggests novel candidates (e.g. GABBR1, RGL2) that form coherent interaction modules and enrich immune-related pathways, underscoring the biological significance of our algorithm in complex trait genetics. ASTWAS is implemented in Python and freely available at https://github.com/wl-Simplecss/ASTWAS.
{"title":"ASTWAS: modeling alternative polyadenylation and SNP effects in kernel-driven TWAS reveal novel genetic associations for complex traits.","authors":"Yan Wang, Lei Wang, Nan Sheng, Jie Hong, Yunzhi Liu, Pengze Wu, XinFei Wang, Shuyan Zhang, Chen Cao","doi":"10.1093/bib/bbaf725","DOIUrl":"10.1093/bib/bbaf725","url":null,"abstract":"<p><p>Alternative polyadenylation (APA) of $3^{prime}$untranslated regions ($3^{prime}$UTRs) is a pervasive mechanism that regulates mRNA stability, localization, and translational efficiency by generating isoforms with distinct $3^{prime}$UTR lengths and regulatory element composition. Despite its critical role in fine-tuning gene expression, APA has been largely overlooked in transcriptome-wide association studies (TWAS), which traditionally rely on linear models of SNP effects. To bridge this gap, we developed ASTWAS, a two-stage framework that first trains APA usage prediction models (BLUP, Elastic Net, LASSO, and TOP1) to quantify SNP impacts on distal poly(A) site choice via the percentage of distal poly(A) site usage index, and then aggregates weighted SNP effects within a kernel method to capture both linear and nonlinear genetic interactions. In extensive simulations spanning additive, epistatic, heterogeneous, compensatory, and single-variant architectures under both pleiotropy and causality scenarios, ASTWAS shows higher statistical power than linear APA-TWAS ($3^{prime}$aTWAS), especially at low heritability and in the presence of SNP interactions. Applied to WTCCC type 1 diabetes and rheumatoid arthritis cohorts, ASTWAS not only rediscovers known susceptibility genes but also suggests novel candidates (e.g. GABBR1, RGL2) that form coherent interaction modules and enrich immune-related pathways, underscoring the biological significance of our algorithm in complex trait genetics. ASTWAS is implemented in Python and freely available at https://github.com/wl-Simplecss/ASTWAS.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12814985/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146003070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Protein language models (pLMs) have become essential tools in computational biology, powering diverse applications from variant effect prediction to protein engineering. Central to their success is the use of pretrained embeddings-contextualized representations of amino acid sequences-which enable effective transfer learning, especially in data-scarce settings. However, recent studies have revealed that standard masked language modeling objectives used to train these models often produce representations that are misaligned with the needs of downstream tasks. While scaling up model size improves performance in some cases, it does not universally yield better representations. In this study, we investigate two complementary strategies for improving pLM representations: (i) integrating text annotations through contrastive learning, and (ii) combining multiple embeddings via embedding fusion. We benchmark six text-integrated pLMs (tpLMs) and three large-scale pLMs across six biologically diverse tasks, showing that no single model dominates across settings. Fusion of multiple tpLMs embeddings improves performance on most tasks but presents a computational bottleneck due to the combinatorial number of possible combinations. To overcome this, we propose greedier forward selection, a linear-time algorithm that efficiently identifies near-optimal embedding subsets. We validate its utility through two case studies, homologous sequence recovery and protein-protein interaction prediction, demonstrating new state-of-the-art results in both. Our work highlights embedding fusion as a practical and scalable strategy for improving protein representations.
{"title":"Scalable embedding fusion with protein language models: insights from benchmarking text-integrated representations.","authors":"Young Su Ko, Jonathan Parkinson, Wei Wang","doi":"10.1093/bib/bbag014","DOIUrl":"10.1093/bib/bbag014","url":null,"abstract":"<p><p>Protein language models (pLMs) have become essential tools in computational biology, powering diverse applications from variant effect prediction to protein engineering. Central to their success is the use of pretrained embeddings-contextualized representations of amino acid sequences-which enable effective transfer learning, especially in data-scarce settings. However, recent studies have revealed that standard masked language modeling objectives used to train these models often produce representations that are misaligned with the needs of downstream tasks. While scaling up model size improves performance in some cases, it does not universally yield better representations. In this study, we investigate two complementary strategies for improving pLM representations: (i) integrating text annotations through contrastive learning, and (ii) combining multiple embeddings via embedding fusion. We benchmark six text-integrated pLMs (tpLMs) and three large-scale pLMs across six biologically diverse tasks, showing that no single model dominates across settings. Fusion of multiple tpLMs embeddings improves performance on most tasks but presents a computational bottleneck due to the combinatorial number of possible combinations. To overcome this, we propose greedier forward selection, a linear-time algorithm that efficiently identifies near-optimal embedding subsets. We validate its utility through two case studies, homologous sequence recovery and protein-protein interaction prediction, demonstrating new state-of-the-art results in both. Our work highlights embedding fusion as a practical and scalable strategy for improving protein representations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12853110/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Re: Qi et al. \"A roadmap for T cell receptor-peptide-MHC binding prediction by machine learning: glimpse and foresight\" (Briefings in Bioinformatics, 2025).","authors":"Cedric Ly, Stefan Bonn, Immo Prinz","doi":"10.1093/bib/bbag032","DOIUrl":"10.1093/bib/bbag032","url":null,"abstract":"","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874877/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Messenger RNA (mRNA) vaccines have revolutionized vaccinology with their rapid development cycles and adaptability, yet their broad application is constrained by unresolved challenges in balancing mRNA structural stability and translational efficiency. Here, we introduce a groundbreaking multi-seed searching algorithm for mRNA codon optimization, an innovative framework that synergistically co-optimizes minimum free energy and codon adaptation index through adaptive integration of simulated annealing and genetic algorithms. This novel approach enhances global search capability to escape local optima, a critical limitation of existing tools. Evaluations across long therapeutic mRNA sequences and short peptides (neoantigens from bladder cancer and melanoma) reveal our algorithm outperforms state-of-the-art LinearDesign, delivering superior balanced improvements in both stability and translational efficiency validating its unique ability to navigate the inherent trade-offs between these two key metrics. Built on this algorithm, the Optiseed platform introduces transformative features including customizable scoring functions, flexible parameters for tailored optimization, and support for integrating untranslated regions (UTRs), poly(A) tails, and other elements to enable end-to-end vaccine construct design. This innovation addresses the rigidity of conventional tools, empowering precise, context-specific optimization. Optiseed represents a robust, scalable solution for mRNA vaccine codon optimization. Its superior performance across diverse sequences underscores its potential to accelerate mRNA-based therapeutic development, particularly in personalized cancer immunotherapy, while offering a framework adaptable for other applications such as infectious disease vaccine design.
{"title":"Multi-seed searching algorithm for integrated codon optimization of mRNA stability and translational efficiency in vaccine design.","authors":"Yuhan Bo, Bingxin Liu, Shengyu Huang, Yanwei Liu, Libin Deng, Dake Zhang, Jing Zhang","doi":"10.1093/bib/bbag047","DOIUrl":"10.1093/bib/bbag047","url":null,"abstract":"<p><p>Messenger RNA (mRNA) vaccines have revolutionized vaccinology with their rapid development cycles and adaptability, yet their broad application is constrained by unresolved challenges in balancing mRNA structural stability and translational efficiency. Here, we introduce a groundbreaking multi-seed searching algorithm for mRNA codon optimization, an innovative framework that synergistically co-optimizes minimum free energy and codon adaptation index through adaptive integration of simulated annealing and genetic algorithms. This novel approach enhances global search capability to escape local optima, a critical limitation of existing tools. Evaluations across long therapeutic mRNA sequences and short peptides (neoantigens from bladder cancer and melanoma) reveal our algorithm outperforms state-of-the-art LinearDesign, delivering superior balanced improvements in both stability and translational efficiency validating its unique ability to navigate the inherent trade-offs between these two key metrics. Built on this algorithm, the Optiseed platform introduces transformative features including customizable scoring functions, flexible parameters for tailored optimization, and support for integrating untranslated regions (UTRs), poly(A) tails, and other elements to enable end-to-end vaccine construct design. This innovation addresses the rigidity of conventional tools, empowering precise, context-specific optimization. Optiseed represents a robust, scalable solution for mRNA vaccine codon optimization. Its superior performance across diverse sequences underscores its potential to accelerate mRNA-based therapeutic development, particularly in personalized cancer immunotherapy, while offering a framework adaptable for other applications such as infectious disease vaccine design.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12885097/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146149172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chia-Chen Chu, Jhong-He Yu, Shang-Che Kuo, Fan-Wei Yang, Chia-Chang Lin, Chang-Hung Chen, Yi-Chen Wu, Cing Shih, Ying-Hsuan Sun, Te-Lun Mai, Ying-Lan Chen, Hsin-Hung Lin, Jung-Chen Su, Ying-Chung Jimmy Lin
NanoPrePro is a streamlined read preprocessor specifically designed for high precision in identifying full-length reads from Oxford Nanopore Technology (ONT) transcriptomic sequencing results, achieved through the precise identification of adapters/primers. However, the preprocessing of ONT reads has been a long-term neglected and ambiguous area without thorough and systematic investigation. Here, we developed NanoPrePro that outperformed the current best preprocessor, Pychopper, using simulated and real datasets. Through sequence similarity, adapter/primer location, and adapter/primer length, NanoPrePro exerted a self-optimizing function to extract the best parameters in each sequencing file for users to customize their analyses. Furthermore, NanoPrePro shows a 38-times faster speed with less memory cost. NanoPrePro can be regarded as the state-of-the-art preprocessor with forward adaptability of ONT sequencing.
{"title":"NanoPrePro: a fully equipped, fast, and memory-efficient preprocessor for nanopore transcriptomic sequencing.","authors":"Chia-Chen Chu, Jhong-He Yu, Shang-Che Kuo, Fan-Wei Yang, Chia-Chang Lin, Chang-Hung Chen, Yi-Chen Wu, Cing Shih, Ying-Hsuan Sun, Te-Lun Mai, Ying-Lan Chen, Hsin-Hung Lin, Jung-Chen Su, Ying-Chung Jimmy Lin","doi":"10.1093/bib/bbag063","DOIUrl":"10.1093/bib/bbag063","url":null,"abstract":"<p><p>NanoPrePro is a streamlined read preprocessor specifically designed for high precision in identifying full-length reads from Oxford Nanopore Technology (ONT) transcriptomic sequencing results, achieved through the precise identification of adapters/primers. However, the preprocessing of ONT reads has been a long-term neglected and ambiguous area without thorough and systematic investigation. Here, we developed NanoPrePro that outperformed the current best preprocessor, Pychopper, using simulated and real datasets. Through sequence similarity, adapter/primer location, and adapter/primer length, NanoPrePro exerted a self-optimizing function to extract the best parameters in each sequencing file for users to customize their analyses. Furthermore, NanoPrePro shows a 38-times faster speed with less memory cost. NanoPrePro can be regarded as the state-of-the-art preprocessor with forward adaptability of ONT sequencing.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12903951/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146194110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xunuo Zhu, Wenyi Zhao, Siqi Wang, Jingwen Yang, Jingqi Zhou, Binbin Zhou, Ji Cao, Bo Yang, Zhan Zhou, Xun Gu
Cancer development is driven by somatic evolution and clonal selection. However, traditional selective pressure analysis methods have treated all sites within a gene equally, such a gene-level model oversimplifies the complexity of cancer evolution. In this study, we introduced CN/CS-calculator, a novel site-specific method that can capture selective pressures acting across different gene sites. By deciphering the interplay between the selection pattern and the function of a gene in oncogenesis, CN/CS-calculator uncovers a unique class of mini-driver genes, which exhibit weak positive selection, with certain critical sites providing context-dependent promoter effects on the fitness of cancer subclones while others are constrained by evolutionary conservation. Our method emphasizes the importance of site-specific analysis in uncovering how subtle evolutionary forces shape cancer biology. The refined understanding offers new insights into the mechanisms of cancer heterogeneity and molecular evolution, with potential implications for advancing therapeutic strategies and prognostic assessments.
{"title":"Identification of cancer mini-drivers by deciphering selective landscape in the cancer genome.","authors":"Xunuo Zhu, Wenyi Zhao, Siqi Wang, Jingwen Yang, Jingqi Zhou, Binbin Zhou, Ji Cao, Bo Yang, Zhan Zhou, Xun Gu","doi":"10.1093/bib/bbaf694","DOIUrl":"10.1093/bib/bbaf694","url":null,"abstract":"<p><p>Cancer development is driven by somatic evolution and clonal selection. However, traditional selective pressure analysis methods have treated all sites within a gene equally, such a gene-level model oversimplifies the complexity of cancer evolution. In this study, we introduced CN/CS-calculator, a novel site-specific method that can capture selective pressures acting across different gene sites. By deciphering the interplay between the selection pattern and the function of a gene in oncogenesis, CN/CS-calculator uncovers a unique class of mini-driver genes, which exhibit weak positive selection, with certain critical sites providing context-dependent promoter effects on the fitness of cancer subclones while others are constrained by evolutionary conservation. Our method emphasizes the importance of site-specific analysis in uncovering how subtle evolutionary forces shape cancer biology. The refined understanding offers new insights into the mechanisms of cancer heterogeneity and molecular evolution, with potential implications for advancing therapeutic strategies and prognostic assessments.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Identifying transcription factors (TFs) responsible for gene expression changes remain a central challenge in functional genomics. TFEA.ChIP is a ChIP-seq-based TF enrichment analysis tool that addresses this by linking TF binding profiles to differentially expressed genes through experimentally supported cis-regulatory element (CRE)-gene associations. Unlike motif- or heuristic-based approaches, TFEA.ChIP adopts a biologically grounded strategy by intersecting TF binding data from ReMap2022 with regulatory maps from ENCODE's rE2G and CREdb. To overcome the high context-specificity of rE2G associations, we developed filtering strategies based on confidence scores and recurrence across biosamples. Benchmarking on 342 curated gene sets from the Molecular Signatures Database C2 CGP collection showed that recurrence-based filtering significantly improved accuracy, outperforming the original GeneHancer-based implementation and leading tools including BARTv2.0, Lisa, ChEA3, and HOMER. A case study on hypoxia further validated the method, demonstrating accurate and pathway-specific enrichment of hypoxia-inducible factor-related TFs using both overrepresentation analysis and gene set enrichment analysis. Additionally, the updated implementation of TFEA.ChIP in R/Bioconductor introduces several user-friendly features, including automated analysis workflows and expression-based filtering of candidate TFs. These additions streamline the integration of TFEA.ChIP into standard RNA-seq analysis pipelines, enabling more efficient and reproducible workflows. Together with its strong benchmarking performance and biologically grounded framework, the updated tool provides a robust and accessible solution for inferring transcriptional regulators from gene expression data.
{"title":"Enhancing TFEA.ChIP with ENCODE regulatory maps for generalizable transcription factor enrichment.","authors":"Yosra Berrouayel, Luis Del Peso","doi":"10.1093/bib/bbaf715","DOIUrl":"10.1093/bib/bbaf715","url":null,"abstract":"<p><p>Identifying transcription factors (TFs) responsible for gene expression changes remain a central challenge in functional genomics. TFEA.ChIP is a ChIP-seq-based TF enrichment analysis tool that addresses this by linking TF binding profiles to differentially expressed genes through experimentally supported cis-regulatory element (CRE)-gene associations. Unlike motif- or heuristic-based approaches, TFEA.ChIP adopts a biologically grounded strategy by intersecting TF binding data from ReMap2022 with regulatory maps from ENCODE's rE2G and CREdb. To overcome the high context-specificity of rE2G associations, we developed filtering strategies based on confidence scores and recurrence across biosamples. Benchmarking on 342 curated gene sets from the Molecular Signatures Database C2 CGP collection showed that recurrence-based filtering significantly improved accuracy, outperforming the original GeneHancer-based implementation and leading tools including BARTv2.0, Lisa, ChEA3, and HOMER. A case study on hypoxia further validated the method, demonstrating accurate and pathway-specific enrichment of hypoxia-inducible factor-related TFs using both overrepresentation analysis and gene set enrichment analysis. Additionally, the updated implementation of TFEA.ChIP in R/Bioconductor introduces several user-friendly features, including automated analysis workflows and expression-based filtering of candidate TFs. These additions streamline the integration of TFEA.ChIP into standard RNA-seq analysis pipelines, enabling more efficient and reproducible workflows. Together with its strong benchmarking performance and biologically grounded framework, the updated tool provides a robust and accessible solution for inferring transcriptional regulators from gene expression data.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12796816/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145958912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}