Pub Date : 2025-12-31eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf199
Ragini Mishra, Nahid Akhtar, Jorge Samuel Leon Magdeleno, Abdul Rajjak Shaikh, Manik Prabhu Narsing Rao, Neeta Raj Sharma, Luigi Cavallo, Mohit Chawla
Pneumocystis jirovecii poses a significant threat to immunocompromised individuals, necessitating the development of an effective vaccine. This study employs an immunoinformatics approach to design a promising vaccine candidate against P. jirovecii. Utilizing various computational tools, the study identified potential antigenic epitopes capable of eliciting immune responses within the P. jirovecii major surface glycoprotein C. The chosen epitopes were evaluated using computational tools for their allergenicity, interferon-γ and interleukin activation ability, and toxicity, ensuring the selection of immunogenic and safe candidates. These analyses led to the selection of 10 epitopes, which were then linked with adjuvants to model a potential vaccine candidate. Molecular docking and molecular dynamics simulations were performed in a solvent environment to investigate the binding interactions between the vaccine candidate and toll-like receptors, along with calculations of thermodynamic properties. Finally, in silico immune simulations were performed to analyze the immunogenic potential of the vaccine candidate. Future prospects include in vitro and in vivo validation of the vaccine candidate and the exploration of novel adjuvants to enhance its immunogenicity. This study contributes to the ongoing efforts to develop a preventive solution against P. jirovecii infections, addressing a critical gap in the protection of immunocompromised individuals.
{"title":"Development of a vaccine construct against <i>Pneumocystis jirovecii</i> pneumonia using computational tools.","authors":"Ragini Mishra, Nahid Akhtar, Jorge Samuel Leon Magdeleno, Abdul Rajjak Shaikh, Manik Prabhu Narsing Rao, Neeta Raj Sharma, Luigi Cavallo, Mohit Chawla","doi":"10.1093/nargab/lqaf199","DOIUrl":"10.1093/nargab/lqaf199","url":null,"abstract":"<p><p><i>Pneumocystis jirovecii</i> poses a significant threat to immunocompromised individuals, necessitating the development of an effective vaccine. This study employs an immunoinformatics approach to design a promising vaccine candidate against <i>P. jirovecii</i>. Utilizing various computational tools, the study identified potential antigenic epitopes capable of eliciting immune responses within the <i>P. jirovecii</i> major surface glycoprotein C. The chosen epitopes were evaluated using computational tools for their allergenicity, interferon-γ and interleukin activation ability, and toxicity, ensuring the selection of immunogenic and safe candidates. These analyses led to the selection of 10 epitopes, which were then linked with adjuvants to model a potential vaccine candidate. Molecular docking and molecular dynamics simulations were performed in a solvent environment to investigate the binding interactions between the vaccine candidate and toll-like receptors, along with calculations of thermodynamic properties. Finally, <i>in silico</i> immune simulations were performed to analyze the immunogenic potential of the vaccine candidate. Future prospects include <i>in vitro</i> and <i>in vivo</i> validation of the vaccine candidate and the exploration of novel adjuvants to enhance its immunogenicity. This study contributes to the ongoing efforts to develop a preventive solution against <i>P. jirovecii</i> infections, addressing a critical gap in the protection of immunocompromised individuals.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf199"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754782/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf201
Yaqi A Deng, Torgny Karlsson, Åsa Johansson
Advances in high-throughput technologies enable large-scale studies on genomics and molecular phenotypes. However, the trade-off between quality and quantity reduces assay sensitivity, and several measurements in large-scale proteomics and metabolomics analytes fall below the limit of detection (LOD). If not properly addressed, this may introduce bias in effect estimates. To address this, we conducted a simulation study to evaluate the performance of linear, Tobit, Cox, and logistic modeling in the presence of below-LOD measurements in genome-wide association studies. We identified the optimal strategy as a two-step Linear-Tobit scheme, including rapid screening with linear regression followed by refinement with Tobit regression to retrieve accurate effect estimates. This higher accuracy helps mitigate a 1.3-fold and 2.7-fold inflation in causal estimates in a Mendelian randomization (MR) study, which would otherwise be present with 50% and 90% values below LOD. Validation through case studies on estradiol and testosterone levels in the UK Biobank confirmed the simulation results across subgroups with varying proportions of below-LOD measurements. The Linear-Tobit scheme offers optimal detection power and efficiency, with a focus on its applicability to biobank-scale datasets and accuracy in effect estimates to mitigate bias in downstream applications such as MR and polygenic risk scores.
{"title":"Improving accuracy in genome-wide association studies: a two-step approach for handling below limit of detection biomarker measurements.","authors":"Yaqi A Deng, Torgny Karlsson, Åsa Johansson","doi":"10.1093/nargab/lqaf201","DOIUrl":"10.1093/nargab/lqaf201","url":null,"abstract":"<p><p>Advances in high-throughput technologies enable large-scale studies on genomics and molecular phenotypes. However, the trade-off between quality and quantity reduces assay sensitivity, and several measurements in large-scale proteomics and metabolomics analytes fall below the limit of detection (LOD). If not properly addressed, this may introduce bias in effect estimates. To address this, we conducted a simulation study to evaluate the performance of linear, Tobit, Cox, and logistic modeling in the presence of below-LOD measurements in genome-wide association studies. We identified the optimal strategy as a two-step Linear-Tobit scheme, including rapid screening with linear regression followed by refinement with Tobit regression to retrieve accurate effect estimates. This higher accuracy helps mitigate a 1.3-fold and 2.7-fold inflation in causal estimates in a Mendelian randomization (MR) study, which would otherwise be present with 50% and 90% values below LOD. Validation through case studies on estradiol and testosterone levels in the UK Biobank confirmed the simulation results across subgroups with varying proportions of below-LOD measurements. The Linear-Tobit scheme offers optimal detection power and efficiency, with a focus on its applicability to biobank-scale datasets and accuracy in effect estimates to mitigate bias in downstream applications such as MR and polygenic risk scores.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf201"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf203
Elio Nushi, François P Douillard, Katja Selby, Benjamin A Blount, Oliver J Pennington, Nigel P Minton, Miia Lindström, Antti Honkela
Transcriptomics experiments are often conducted to capture changes in gene expression over time. However, time annotations may be missing, imprecise, or not reflect the same physiological state of the bacterial culture between different experiments. Assigning accurate time points to these experiments using a reference time course is therefore crucial for identifying differentially expressed genes, and understanding gene regulatory networks for elucidating the studied organism's physiology and life cycle. This important task, which could enhance the biological interpretation of the transcriptomics experiments, has not been previously addressed. In this work, we propose a novel method to solve the challenge of realigning transcriptomics experiments based on a reference time course. Our method is based on a Bayesian approach that uses Gaussian process regression modeling. We show a use case of applying our method for assigning time annotations in legacy microarray samples of the bacterium Clostridium botulinum, which were solely annotated based on the growth phase at the time when the culture aliquots were sampled, utilizing recently collected RNA-Seq time series data comprising multiple replicates as a reference. The method significantly improved the description of the growth phases of the microarray data compared to the original annotations by clearly delineating the microarray samples belonging to different growth phases, as demonstrated by principal component analysis. Consequently, a larger number of differentially expressed genes was detected when comparing experiments belonging to successive growth phases. We compare this innovative approach with a baseline method that uses k-nearest neighbor algorithm and show that our method offers a higher resolution in the description of the data by exposing smaller time changes between samples. We also test the performance of the method on sparse RNA-Seq time series (i.e. sampled every second hour). All the predictions for the samples were within a 30-min margin of their true time.
{"title":"A supervised Bayesian method for time (re)annotation of transcriptomics data.","authors":"Elio Nushi, François P Douillard, Katja Selby, Benjamin A Blount, Oliver J Pennington, Nigel P Minton, Miia Lindström, Antti Honkela","doi":"10.1093/nargab/lqaf203","DOIUrl":"10.1093/nargab/lqaf203","url":null,"abstract":"<p><p>Transcriptomics experiments are often conducted to capture changes in gene expression over time. However, time annotations may be missing, imprecise, or not reflect the same physiological state of the bacterial culture between different experiments. Assigning accurate time points to these experiments using a reference time course is therefore crucial for identifying differentially expressed genes, and understanding gene regulatory networks for elucidating the studied organism's physiology and life cycle. This important task, which could enhance the biological interpretation of the transcriptomics experiments, has not been previously addressed. In this work, we propose a novel method to solve the challenge of realigning transcriptomics experiments based on a reference time course. Our method is based on a Bayesian approach that uses Gaussian process regression modeling. We show a use case of applying our method for assigning time annotations in legacy microarray samples of the bacterium <i>Clostridium botulinum</i>, which were solely annotated based on the growth phase at the time when the culture aliquots were sampled, utilizing recently collected RNA-Seq time series data comprising multiple replicates as a reference. The method significantly improved the description of the growth phases of the microarray data compared to the original annotations by clearly delineating the microarray samples belonging to different growth phases, as demonstrated by principal component analysis. Consequently, a larger number of differentially expressed genes was detected when comparing experiments belonging to successive growth phases. We compare this innovative approach with a baseline method that uses k-nearest neighbor algorithm and show that our method offers a higher resolution in the description of the data by exposing smaller time changes between samples. We also test the performance of the method on sparse RNA-Seq time series (i.e. sampled every second hour). All the predictions for the samples were within a 30-min margin of their true time.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf203"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754789/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf205
Boqi Wang, Jiayi Wang, Ammar Aleem Rashied, Bo Meng, Jesse Zhang, Jun S Liu, Jie Jiang, Zhaohui S Qin
Accurate identification of affected tissues of human diseases is important for the derivation of disease etiology and the development of new treatment strategies. In this study, we develop a logistic regression-based method named DEDUCE (disease tissue detection using logistic regression) that combines genomics big data and machine learning to address this important problem. The central hypothesis is that most disease-associated genes are expressed specifically in affected tissues. DEDUCE takes advantage of newly emerged data on disease-related genes as well as tissue-specific gene expression data. The unique feature of DEDUCE is that it takes into account the strength of gene-disease associations. When we applied DEDUCE to a total of 3261, 324 gene-disease associations collected from DisGeNET covering 30,170 diseases and 21,666 genes, we identified 216 significant tissue-disease pairs composed of 120 unique diseases and 37 unique tissues. Many of them shed light on potential explanations for disease pathogenesis. The results showed great consistency with previous findings and were proven effective by empirical plots and gene set enrichment analysis. Overall, DEDUCE has shown great potential in uncovering novel pathogenesis mechanisms of complex diseases. In-depth analysis and experimental validation were required to fully understand these discovered tissue-trait associations and their enriched genes.
{"title":"DEDUCE: statistical inference on disease-associated genes uncovers tissue-disease associations.","authors":"Boqi Wang, Jiayi Wang, Ammar Aleem Rashied, Bo Meng, Jesse Zhang, Jun S Liu, Jie Jiang, Zhaohui S Qin","doi":"10.1093/nargab/lqaf205","DOIUrl":"10.1093/nargab/lqaf205","url":null,"abstract":"<p><p>Accurate identification of affected tissues of human diseases is important for the derivation of disease etiology and the development of new treatment strategies. In this study, we develop a logistic regression-based method named DEDUCE (disease tissue detection using logistic regression) that combines genomics big data and machine learning to address this important problem. The central hypothesis is that most disease-associated genes are expressed specifically in affected tissues. DEDUCE takes advantage of newly emerged data on disease-related genes as well as tissue-specific gene expression data. The unique feature of DEDUCE is that it takes into account the strength of gene-disease associations. When we applied DEDUCE to a total of 3261, 324 gene-disease associations collected from DisGeNET covering 30,170 diseases and 21,666 genes, we identified 216 significant tissue-disease pairs composed of 120 unique diseases and 37 unique tissues. Many of them shed light on potential explanations for disease pathogenesis. The results showed great consistency with previous findings and were proven effective by empirical plots and gene set enrichment analysis. Overall, DEDUCE has shown great potential in uncovering novel pathogenesis mechanisms of complex diseases. In-depth analysis and experimental validation were required to fully understand these discovered tissue-trait associations and their enriched genes.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf205"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf210
Feng Zhang, Heqin Zhu, Jiayin Gao, Jie Hu, Ke Chen, Shaohua Kevin Zhou, Peng Xiong
The internal ribosome entry site (IRES) is a special type of RNA cis-acting element that can initiate translation independently of the 5' cap structure and is widely found in viral RNAs and eukaryotic messenger RNAs. In recent years, an increasing number of studies have revealed that IRES elements also exist in circular RNAs (circRNAs) and mediate their translation. CircRNAs exhibit high stability and tissue specificity, playing critical roles in various physiological and pathological processes. Their coding potential provides important clues for the discovery of novel functional proteins. However, due to the nonlinear structure of circRNAs and the complexity of IRES-mediated regulatory mechanisms, accurately identifying IRES elements within circRNAs remains a significant challenge. Here, we propose IRESeek, a dual-branch deep learning framework for highly accurate detection of IRES elements in circRNA, which utilizes transformer for RNA sequence modeling and graph convolutional network for RNA structural guidance. To grasp the structural patterns of circRNAs, IRESeek employs physical-based thermodynamic energy of RNA secondary structure-base pair motif energy and the base pair probability as guidance structural characteristics to incorporate with RNA sequence, enabling comprehensive joint learning of RNA sequence and base pair interactions.
{"title":"IRESeek: structure-informed deep learning method for accurate identification of internal ribosome entry sites in circular RNAs.","authors":"Feng Zhang, Heqin Zhu, Jiayin Gao, Jie Hu, Ke Chen, Shaohua Kevin Zhou, Peng Xiong","doi":"10.1093/nargab/lqaf210","DOIUrl":"10.1093/nargab/lqaf210","url":null,"abstract":"<p><p>The internal ribosome entry site (IRES) is a special type of RNA <i>cis</i>-acting element that can initiate translation independently of the 5' cap structure and is widely found in viral RNAs and eukaryotic messenger RNAs. In recent years, an increasing number of studies have revealed that IRES elements also exist in circular RNAs (circRNAs) and mediate their translation. CircRNAs exhibit high stability and tissue specificity, playing critical roles in various physiological and pathological processes. Their coding potential provides important clues for the discovery of novel functional proteins. However, due to the nonlinear structure of circRNAs and the complexity of IRES-mediated regulatory mechanisms, accurately identifying IRES elements within circRNAs remains a significant challenge. Here, we propose IRESeek, a dual-branch deep learning framework for highly accurate detection of IRES elements in circRNA, which utilizes transformer for RNA sequence modeling and graph convolutional network for RNA structural guidance. To grasp the structural patterns of circRNAs, IRESeek employs physical-based thermodynamic energy of RNA secondary structure-base pair motif energy and the base pair probability as guidance structural characteristics to incorporate with RNA sequence, enabling comprehensive joint learning of RNA sequence and base pair interactions.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf210"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145889649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf208
Rongxin Zhang, Jean-Louis Mergny
The precise regulation of gene transcription relies on promoters, and the selection of specific promoters for a particular gene is a key determinant of transcript diversity. However, the regulatory mechanisms governing promoter selection are not fully understood. G-quadruplexes (G4s) are unique DNA noncanonical secondary structures that have emerged as important regulators of gene expression. In this study, we systematically analyzed the relationship between G4 structures and alternative promoters (APs) in two cancer cell lines, K562 and HepG2, by integrating native elongating transcript-cap analysis of gene expression and G4 ChIP-seq datasets. We identified 573 differentially utilized APs (|fold change| > 2, false discovery rate < 0.05), 26% of which being associated with G4 structures within 100 base pairs. Notably, G4-associated promoters predominantly exhibited increased activity, suggesting that G4s generally promote AP selection. Furthermore, treatment with G4 ligands induced the generation of APs, suggesting that the stabilization of G4 structures may modulate AP usage. Collectively, these findings provide new insights into the G4-based mechanisms that regulate transcript isoform diversity.
{"title":"G-quadruplex structures as modulators of alternative promoter usage.","authors":"Rongxin Zhang, Jean-Louis Mergny","doi":"10.1093/nargab/lqaf208","DOIUrl":"10.1093/nargab/lqaf208","url":null,"abstract":"<p><p>The precise regulation of gene transcription relies on promoters, and the selection of specific promoters for a particular gene is a key determinant of transcript diversity. However, the regulatory mechanisms governing promoter selection are not fully understood. G-quadruplexes (G4s) are unique DNA noncanonical secondary structures that have emerged as important regulators of gene expression. In this study, we systematically analyzed the relationship between G4 structures and alternative promoters (APs) in two cancer cell lines, K562 and HepG2, by integrating native elongating transcript-cap analysis of gene expression and G4 ChIP-seq datasets. We identified 573 differentially utilized APs (|fold change| > 2, false discovery rate < 0.05), 26% of which being associated with G4 structures within 100 base pairs. Notably, G4-associated promoters predominantly exhibited increased activity, suggesting that G4s generally promote AP selection. Furthermore, treatment with G4 ligands induced the generation of APs, suggesting that the stabilization of G4 structures may modulate AP usage. Collectively, these findings provide new insights into the G4-based mechanisms that regulate transcript isoform diversity.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf208"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754776/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf191
Moritz Burghardt, Alon Diament, Tamir Tuller
High expression of heterologous proteins is often achieved by integrating multiple copies of a gene into a host. However, such multicopy systems are prone to genetic instability due to homologous recombination between identical sequences. We present the multisequence ChimeraMap (MScMap), an algorithm for designing multiple synonymous coding sequences that minimizes recombination risk while maintaining high expression. MScMap extends the ChimeraMap framework by selecting diverse nucleotide blocks from a host genome to encode the target protein, balancing host adaptation and sequence dissimilarity. We introduce heuristics for block selection and concatenation to reduce long common substrings, a known driver of recombination. Our method outperforms a multi-objective evolutionary algorithm in both genetic stability and predicted expression across a wide range of human proteins while being significantly faster. We also show that MScMap can also be used to reduce sequence repeats within a single coding sequence. A web tool for single and multicopy coding sequence optimization is available online.
{"title":"Designing genetically stable multicopy gene constructs with the ChimeraUGEM web server.","authors":"Moritz Burghardt, Alon Diament, Tamir Tuller","doi":"10.1093/nargab/lqaf191","DOIUrl":"10.1093/nargab/lqaf191","url":null,"abstract":"<p><p>High expression of heterologous proteins is often achieved by integrating multiple copies of a gene into a host. However, such multicopy systems are prone to genetic instability due to homologous recombination between identical sequences. We present the multisequence ChimeraMap (MScMap), an algorithm for designing multiple synonymous coding sequences that minimizes recombination risk while maintaining high expression. MScMap extends the ChimeraMap framework by selecting diverse nucleotide blocks from a host genome to encode the target protein, balancing host adaptation and sequence dissimilarity. We introduce heuristics for block selection and concatenation to reduce long common substrings, a known driver of recombination. Our method outperforms a multi-objective evolutionary algorithm in both genetic stability and predicted expression across a wide range of human proteins while being significantly faster. We also show that MScMap can also be used to reduce sequence repeats within a single coding sequence. A web tool for single and multicopy coding sequence optimization is available online.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf191"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf174
Daniel Gómez-Pérez, Alexander Keller
Understanding microbial phenotypes from genomic data is crucial for studying co-evolution, ecology, and pathology. This study presents a scalable approach that integrates literature-extracted information with genomic data, combining natural language processing and functional genome analysis. We applied this method to publicly available data, providing novel insights into predicting microbial phenotypes. We fine-tuned transformer-based language models to analyze 3.83 million open-access scientific articles, extracting a phenotypic network of bacterial strains. This network maps relationships between strains and traits such as pathogenicity, metabolism, and biome preference. By annotating their reference genomes, we predicted key genes influencing these traits. Our findings align with known phenotypes, reveal novel correlations, and uncover genes involved in disease and host associations. The network's interconnectivity provides deeper understanding of microbial communities and allowed identification of hub species through inferred trophic connections that are difficult to infer experimentally. This work demonstrates the potential of machine learning for uncovering cross-species gene-phenotype patterns. As microbial genomic data and literature expand, such methods will be essential for extracting meaningful insights and advancing microbiology research. In summary, this integrative approach can accelerate discovery and understanding in microbial genomics. Ultimately, such techniques will facilitate the study of microbial ecology, co-evolutionary processes, and disease pathogenesis to an unprecedented depth.
{"title":"Integrating natural language processing and genome analysis enables accurate bacterial phenotype prediction.","authors":"Daniel Gómez-Pérez, Alexander Keller","doi":"10.1093/nargab/lqaf174","DOIUrl":"10.1093/nargab/lqaf174","url":null,"abstract":"<p><p>Understanding microbial phenotypes from genomic data is crucial for studying co-evolution, ecology, and pathology. This study presents a scalable approach that integrates literature-extracted information with genomic data, combining natural language processing and functional genome analysis. We applied this method to publicly available data, providing novel insights into predicting microbial phenotypes. We fine-tuned transformer-based language models to analyze 3.83 million open-access scientific articles, extracting a phenotypic network of bacterial strains. This network maps relationships between strains and traits such as pathogenicity, metabolism, and biome preference. By annotating their reference genomes, we predicted key genes influencing these traits. Our findings align with known phenotypes, reveal novel correlations, and uncover genes involved in disease and host associations. The network's interconnectivity provides deeper understanding of microbial communities and allowed identification of hub species through inferred trophic connections that are difficult to infer experimentally. This work demonstrates the potential of machine learning for uncovering cross-species gene-phenotype patterns. As microbial genomic data and literature expand, such methods will be essential for extracting meaningful insights and advancing microbiology research. In summary, this integrative approach can accelerate discovery and understanding in microbial genomics. Ultimately, such techniques will facilitate the study of microbial ecology, co-evolutionary processes, and disease pathogenesis to an unprecedented depth.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf174"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746109/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf192
Marta Moreno-González, Jeroen de Ridder, Jop Kind, Robin H van der Weide
Single-cell profiling of histone post-translational modifications (scHPTMs) offers a powerful lens for dissecting epigenetic regulation and cellular identity, yet low read depth and inherent noise in these datasets pose significant analytical challenges. Here, we introduce the first comprehensive computational framework that systematically evaluates imputation strategies on scHPTM data, including methods originally developed for scRNA-seq and scATAC-seq. Leveraging both synthetic and published datasets, we apply novel performance metrics-implemented in a modular R package-to assess signal recovery, enrichment at biologically relevant genomic sites, and preservation of cell-to-cell similarities. Our extensive benchmarking reveals that performance varies markedly by analytical task (e.g. signal denoising, peak detection, and clustering), highlighting that no one-size-fits-all solution exists for these data. By delineating the strengths and limitations of current imputation approaches, this work lays the foundation for the targeted development of next-generation, task-aware algorithms, while providing critical guidance for researchers and developers on the current capabilities and unmet needs in single-cell epigenomics.
{"title":"A computational framework to dissect imputation strategies for single-cell histone modification data.","authors":"Marta Moreno-González, Jeroen de Ridder, Jop Kind, Robin H van der Weide","doi":"10.1093/nargab/lqaf192","DOIUrl":"10.1093/nargab/lqaf192","url":null,"abstract":"<p><p>Single-cell profiling of histone post-translational modifications (scHPTMs) offers a powerful lens for dissecting epigenetic regulation and cellular identity, yet low read depth and inherent noise in these datasets pose significant analytical challenges. Here, we introduce the first comprehensive computational framework that systematically evaluates imputation strategies on scHPTM data, including methods originally developed for scRNA-seq and scATAC-seq. Leveraging both synthetic and published datasets, we apply novel performance metrics-implemented in a modular R package-to assess signal recovery, enrichment at biologically relevant genomic sites, and preservation of cell-to-cell similarities. Our extensive benchmarking reveals that performance varies markedly by analytical task (e.g. signal denoising, peak detection, and clustering), highlighting that no one-size-fits-all solution exists for these data. By delineating the strengths and limitations of current imputation approaches, this work lays the foundation for the targeted development of next-generation, task-aware algorithms, while providing critical guidance for researchers and developers on the current capabilities and unmet needs in single-cell epigenomics.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf192"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746105/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-23eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf189
Monireh Safari, Joseph Butler, Gurjit S Randhawa, Kathleen A Hill, Lila Kari
Extreme environments impose strong mutation and selection pressures that drive distinctive, yet understudied, genomic adaptations in extremophiles. In this study, we identify 15 bacterium-archaeon pairs that exhibit highly similar [Formula: see text]-mer-based genomic signatures despite maximal taxonomic divergence, suggesting that shared environmental conditions can produce convergent, genome-wide sequence patterns that transcend evolutionary distance. To uncover these patterns, we developed a computational pipeline to select a composite genome proxy assembled from noncontiguous subsequences of the genome. Using supervised machine learning on a curated dataset of 693 extremophile microbial genomes, we found that 6-mers and 100 kbp genome proxy lengths provide the best balance between classification accuracy and computational efficiency. Our results provide conclusive evidence of the pervasive nature of [Formula: see text]-mer-based patterns across the genome, and uncover the presence of taxonomic and environmental components that persist across all regions of the genome. The 15 bacterium-archaeon pairs identified by our method as having similar genomic signatures were validated through multiple independent analyses, including 3-mer frequency profile comparisons, phenotypic trait similarity, and geographic co-occurrence data. These complementary validations confirmed that extreme environmental pressures can override traditionally recognized taxonomic components at the whole-genome level. Together, these findings reveal that adaptation to extreme conditions can carry robust, taxonomic domain-spanning imprints on microbial genomes, offering new insight into the relationship between environmental impacts and genome sequence composition convergence.
{"title":"Life at the extremes: maximally divergent microbes with similar genomic signatures linked to extreme environments.","authors":"Monireh Safari, Joseph Butler, Gurjit S Randhawa, Kathleen A Hill, Lila Kari","doi":"10.1093/nargab/lqaf189","DOIUrl":"10.1093/nargab/lqaf189","url":null,"abstract":"<p><p>Extreme environments impose strong mutation and selection pressures that drive distinctive, yet understudied, genomic adaptations in extremophiles. In this study, we identify 15 bacterium-archaeon pairs that exhibit highly similar [Formula: see text]-mer-based genomic signatures despite maximal taxonomic divergence, suggesting that shared environmental conditions can produce convergent, genome-wide sequence patterns that transcend evolutionary distance. To uncover these patterns, we developed a computational pipeline to select a composite genome proxy assembled from noncontiguous subsequences of the genome. Using supervised machine learning on a curated dataset of 693 extremophile microbial genomes, we found that 6-mers and 100 kbp genome proxy lengths provide the best balance between classification accuracy and computational efficiency. Our results provide conclusive evidence of the pervasive nature of [Formula: see text]-mer-based patterns across the genome, and uncover the presence of taxonomic and environmental components that persist across all regions of the genome. The 15 bacterium-archaeon pairs identified by our method as having similar genomic signatures were validated through multiple independent analyses, including 3-mer frequency profile comparisons, phenotypic trait similarity, and geographic co-occurrence data. These complementary validations confirmed that extreme environmental pressures can override traditionally recognized taxonomic components at the whole-genome level. Together, these findings reveal that adaptation to extreme conditions can carry robust, taxonomic domain-spanning imprints on microbial genomes, offering new insight into the relationship between environmental impacts and genome sequence composition convergence.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf189"},"PeriodicalIF":2.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145828555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}