Pub Date : 2025-12-17eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf183
Mathias Witte Paz, Alina Bitzer, Kay Nieselt, Simon Heilbronner
Metallophores are secondary metabolites that enable bacterial growth in metal-limited environments such as the human nasal microbiome. While synthesis and uptake of metallophores in Staphylococcus aureus are well characterized, the diversity across the Staphylococcus genus remains unclear. We performed a comprehensive bioinformatic analysis of 77 representative species, as well as over 1800 strains, to map metallophore biosynthetic gene clusters (BGCs) and uptake systems. Staphyloferrin A (SF-A) biosynthesis was widely conserved, though disrupted loci were found in some species, with some of them appearing to have replaced SF-A with a newly discovered, still uncharacterized, BGC. In contrast, staphyloferrin B and staphylopine production were restricted to select species. Uptake systems were more broadly distributed, showing evidence of "cheating" species that lack biosynthesis, but retain the required lipoproteins for metallophore usage. Staphylococcus lugdunensis exemplifies this, encoding multiple uptake systems without producing known metallophores. Strain-level variation was also observed, particularly with specific cases of SF-A truncation, but also for the diversity of lipoprotein receptors. These findings highlight the diversity of metallophore systems, suggesting diverse metallophore-dependent cooperation and competition within the Staphylococcus genus. This work provides a foundation for future experimental studies to identify the role of metallophores in microbial community interactions.
金属微粒是次级代谢物,使细菌能够在金属受限的环境中生长,如人类鼻腔微生物群。虽然金黄色葡萄球菌中金属蛋白的合成和摄取已被很好地表征,但葡萄球菌属的多样性仍不清楚。我们对77个代表性物种和1800多个菌株进行了全面的生物信息学分析,以绘制金属蛋白生物合成基因簇(BGCs)和摄取系统。葡萄球菌铁蛋白A (Staphyloferrin A, SF-A)的生物合成是广泛保守的,尽管在一些物种中发现了断裂的位点,其中一些位点似乎用新发现的尚未鉴定的BGC取代了SF-A。相比之下,葡萄铁蛋白B和葡萄碱的产量受限于特定种类。摄取系统分布更广泛,显示了缺乏生物合成的“欺骗”物种的证据,但保留了金属蛋白使用所需的脂蛋白。lugdunensis葡萄球菌就是一个例子,它编码多个摄取系统而不产生已知的金属细胞。还观察到菌株水平的变化,特别是在SF-A截断的特定情况下,但也观察到脂蛋白受体的多样性。这些发现突出了金属细胞系统的多样性,表明葡萄球菌属中金属细胞依赖的合作和竞争的多样性。本研究为进一步研究金属蛋白在微生物群落相互作用中的作用奠定了基础。
{"title":"A landscape of metallophore synthesis and uptake potential of the genus <i>Staphylococcus</i>.","authors":"Mathias Witte Paz, Alina Bitzer, Kay Nieselt, Simon Heilbronner","doi":"10.1093/nargab/lqaf183","DOIUrl":"10.1093/nargab/lqaf183","url":null,"abstract":"<p><p>Metallophores are secondary metabolites that enable bacterial growth in metal-limited environments such as the human nasal microbiome. While synthesis and uptake of metallophores in <i>Staphylococcus aureus</i> are well characterized, the diversity across the <i>Staphylococcus</i> genus remains unclear. We performed a comprehensive bioinformatic analysis of 77 representative species, as well as over 1800 strains, to map metallophore biosynthetic gene clusters (BGCs) and uptake systems. Staphyloferrin A (SF-A) biosynthesis was widely conserved, though disrupted loci were found in some species, with some of them appearing to have replaced SF-A with a newly discovered, still uncharacterized, BGC. In contrast, staphyloferrin B and staphylopine production were restricted to select species. Uptake systems were more broadly distributed, showing evidence of \"cheating\" species that lack biosynthesis, but retain the required lipoproteins for metallophore usage. <i>Staphylococcus lugdunensis</i> exemplifies this, encoding multiple uptake systems without producing known metallophores. Strain-level variation was also observed, particularly with specific cases of SF-A truncation, but also for the diversity of lipoprotein receptors. These findings highlight the diversity of metallophore systems, suggesting diverse metallophore-dependent cooperation and competition within the <i>Staphylococcus</i> genus. This work provides a foundation for future experimental studies to identify the role of metallophores in microbial community interactions.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf183"},"PeriodicalIF":2.8,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12709192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145782945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf162
Marc Kealhofer, Ruth Brown, Brien P Riley, Tan-Hoang Nguyen
Rare exonic variant studies have previously implicated overlapping risk genes and pathways for autism spectrum disorder (ASD), severe, undiagnosed developmental disorders (UDDs), intellectual disability (ID), congenital heart disease (CHD), and schizophrenia (SCZ). Here, we use a two-trait Bayesian integrative analysis approach on 43 287 ASD, UDD/ID, CHD, and SCZ case trios to increase statistical power for gene discovery and to identify shared risk genes. At a posterior probability > 0.80, we identified 180 candidate risk genes for ASD, 315 for UDD/ID, 49 for CHD, and 47 for SCZ, including genes not previously reported, and also detected shared risk genes in pair-wise analyses. Gene set enrichment analysis of the ASD-UDD/ID, ASD-SCZ, and UDD/ID-SCZ shared risk genes overwhelmingly implicated gene sets associated with the synapse and epigenetic modification, while CHD-ASD shared risk genes were enriched in cell cycle phase transition gene sets, and CHD-UDD/ID shared risk genes implicated cardiac development. ASD-UDD/ID risk genes had elevated expression in interneurons and pyramidal cells, while ASD-UDD/ID and CHD-UDD/ID shared risk genes showed elevated connectivity in protein-protein interaction networks. Leveraging information across disorders with genetic overlap, both to increase power for candidate risk gene discovery and also as a method to elucidate shared genetic mechanisms.
{"title":"Joint analysis of <i>de novo</i> mutations from autism spectrum disorder, schizophrenia, congenital heart disease, and other developmental disorders improves detection power and implicates shared molecular pathways and CNS processes.","authors":"Marc Kealhofer, Ruth Brown, Brien P Riley, Tan-Hoang Nguyen","doi":"10.1093/nargab/lqaf162","DOIUrl":"10.1093/nargab/lqaf162","url":null,"abstract":"<p><p>Rare exonic variant studies have previously implicated overlapping risk genes and pathways for autism spectrum disorder (ASD), severe, undiagnosed developmental disorders (UDDs), intellectual disability (ID), congenital heart disease (CHD), and schizophrenia (SCZ). Here, we use a two-trait Bayesian integrative analysis approach on 43 287 ASD, UDD/ID, CHD, and SCZ case trios to increase statistical power for gene discovery and to identify shared risk genes. At a posterior probability > 0.80, we identified 180 candidate risk genes for ASD, 315 for UDD/ID, 49 for CHD, and 47 for SCZ, including genes not previously reported, and also detected shared risk genes in pair-wise analyses. Gene set enrichment analysis of the ASD-UDD/ID, ASD-SCZ, and UDD/ID-SCZ shared risk genes overwhelmingly implicated gene sets associated with the synapse and epigenetic modification, while CHD-ASD shared risk genes were enriched in cell cycle phase transition gene sets, and CHD-UDD/ID shared risk genes implicated cardiac development. ASD-UDD/ID risk genes had elevated expression in interneurons and pyramidal cells, while ASD-UDD/ID and CHD-UDD/ID shared risk genes showed elevated connectivity in protein-protein interaction networks. Leveraging information across disorders with genetic overlap, both to increase power for candidate risk gene discovery and also as a method to elucidate shared genetic mechanisms.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf162"},"PeriodicalIF":2.8,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12709184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145783037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent expansion of prokaryotic genomes reveals many ortholog groups (OGs) whose function cannot be inferred from conventional, sequence similarity-based annotation methods, especially in metagenome-assembled genomes. Phylogenetic profiling is one of the promising methods to annotate these OGs, by identifying functional relationships of OGs using co- or anti-occurrence of OG distributions, not sequence similarity. Here, we proposed two new phylogenetic methods for large-scale data, Ancestral State Adjustment (ASA) and Simultaneous EVolution test (SEV), which consider the ancestral state of OG presence/absence. In evaluations using three distinct prokaryotic datasets, ASA and SEV showed better or comparable performance to both established and recently proposed methods for large-scale data. We compared the functionally related OGs detected by each method and found that SEV and its predecessor can identify slowly evolving OGs, such as housekeeping genes. In contrast, ASA and its predecessors can detect functionally related OGs that tend to be gained or lost in a fixed order, indicating a strong evolutionary constraint that provides clues for functional prediction. Using matrix multiplication, we also showed that SEV is scalable in the latest genome databases.
{"title":"CORGIAS: identifying correlated gene pairs by considering evolutionary history in a large-scale prokaryotic genome dataset.","authors":"Yuki Nishimura, Kimiho Omae, Kento Tominaga, Wataru Iwasaki","doi":"10.1093/nargab/lqaf182","DOIUrl":"10.1093/nargab/lqaf182","url":null,"abstract":"<p><p>The recent expansion of prokaryotic genomes reveals many ortholog groups (OGs) whose function cannot be inferred from conventional, sequence similarity-based annotation methods, especially in metagenome-assembled genomes. Phylogenetic profiling is one of the promising methods to annotate these OGs, by identifying functional relationships of OGs using co- or anti-occurrence of OG distributions, not sequence similarity. Here, we proposed two new phylogenetic methods for large-scale data, Ancestral State Adjustment (ASA) and Simultaneous EVolution test (SEV), which consider the ancestral state of OG presence/absence. In evaluations using three distinct prokaryotic datasets, ASA and SEV showed better or comparable performance to both established and recently proposed methods for large-scale data. We compared the functionally related OGs detected by each method and found that SEV and its predecessor can identify slowly evolving OGs, such as housekeeping genes. In contrast, ASA and its predecessors can detect functionally related OGs that tend to be gained or lost in a fixed order, indicating a strong evolutionary constraint that provides clues for functional prediction. Using matrix multiplication, we also showed that SEV is scalable in the latest genome databases.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf182"},"PeriodicalIF":2.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699329/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145757792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-10eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf161
Hanah Robertson, Hoang T T Do, Volkhard Helms
Exonic enrichment of histone marks hints at their role in regulating alternative splicing. This study aims to connect the transcriptome and epigenome in the context of splicing outcomes in embryonic cell lines. The tools rMATS and MANorm were used to obtain estimates of differential inclusion of exons and differential enrichment of epigenetic signals, respectively. Two classes of alternative exons were identified in embryonic cell lines: those differentially co-occurring with at least one mark among H3K27ac, H3K27me3, H3K36me3, H3K9me3, and H3K4me3, and those marked by neither of these marks. Binary classifiers were trained using RNA-binding protein (RBP) binding affinities on the flanking regions of these exons. This resulted in a set of RBPs, whose putative binding was predicted to associate local chromatin modification marking an exon with its differential inclusion, some of which have been experimentally shown to interact with histone mark reader proteins. We speculate that sequence signals harbored at exon-intron flanks regulate differential splicing of exons, marked by at least one of the five epigenetic signatures. Finally, eCLIP data from ENCODE for the HepG2 and K562 cell lines support TIA1 and U2AF2 as potential episplicing RBPs, as predicted by our model in the embryonic cell lines.
{"title":"RNA-binding proteins connect Exon usage to the chromatin.","authors":"Hanah Robertson, Hoang T T Do, Volkhard Helms","doi":"10.1093/nargab/lqaf161","DOIUrl":"10.1093/nargab/lqaf161","url":null,"abstract":"<p><p>Exonic enrichment of histone marks hints at their role in regulating alternative splicing. This study aims to connect the transcriptome and epigenome in the context of splicing outcomes in embryonic cell lines. The tools rMATS and MANorm were used to obtain estimates of differential inclusion of exons and differential enrichment of epigenetic signals, respectively. Two classes of alternative exons were identified in embryonic cell lines: those differentially co-occurring with at least one mark among H3K27ac, H3K27me3, H3K36me3, H3K9me3, and H3K4me3, and those marked by neither of these marks. Binary classifiers were trained using RNA-binding protein (RBP) binding affinities on the flanking regions of these exons. This resulted in a set of RBPs, whose putative binding was predicted to associate local chromatin modification marking an exon with its differential inclusion, some of which have been experimentally shown to interact with histone mark reader proteins. We speculate that sequence signals harbored at exon-intron flanks regulate differential splicing of exons, marked by at least one of the five epigenetic signatures. Finally, eCLIP data from ENCODE for the HepG2 and K562 cell lines support TIA1 and U2AF2 as potential episplicing RBPs, as predicted by our model in the embryonic cell lines.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf161"},"PeriodicalIF":2.8,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12693533/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145744686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-10eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf165
Andre Jatmiko Wijaya, Aleksandar Anžel, Hugues Richard, Georges Hattab
Horizontal gene transfer (HGT) accelerates the spread of antimicrobial resistance (AMR) via mobile genetic elements allowing pathogens to acquire resistance genes across species. This process drives the evolution of multidrug-resistant "superbugs" in clinical settings. Detection of HGT is critical to mitigating AMR, but traditional methods based on sequence assembly or comparative genomics lack resolution for complex transfer events. While machine learning (ML) promises improved detection, several studies in other domains have demonstrated that data representations will strongly influence its performance. There is, however, no clear recommendation on the best data representation for HGT detection. Here, we evaluated 44 genomic data representations using five ML models across four data sets. We demonstrate that ML performance is highly dependent on the genomic data representation. The RCKmer-based representation (k = 7) paired with a support vector machine is found to be optimal (F1: 0.959; MCC: 0.908), outperforming other approaches. Moreover, models trained on multi-species data sets are shown to generalize better. Our findings suggest that genomic surveillance benefits from task-specific genome data representations. This work provides state-of-the-art, fine-tuned models for identifying and annotating genomic islands that will enable proper detection of transfer of AMR-related genes between species.
{"title":"Genomic data representations for horizontal gene transfer detection.","authors":"Andre Jatmiko Wijaya, Aleksandar Anžel, Hugues Richard, Georges Hattab","doi":"10.1093/nargab/lqaf165","DOIUrl":"10.1093/nargab/lqaf165","url":null,"abstract":"<p><p>Horizontal gene transfer (HGT) accelerates the spread of antimicrobial resistance (AMR) via mobile genetic elements allowing pathogens to acquire resistance genes across species. This process drives the evolution of multidrug-resistant \"superbugs\" in clinical settings. Detection of HGT is critical to mitigating AMR, but traditional methods based on sequence assembly or comparative genomics lack resolution for complex transfer events. While machine learning (ML) promises improved detection, several studies in other domains have demonstrated that data representations will strongly influence its performance. There is, however, no clear recommendation on the best data representation for HGT detection. Here, we evaluated 44 genomic data representations using five ML models across four data sets. We demonstrate that ML performance is highly dependent on the genomic data representation. The RCKmer-based representation (<i>k</i> = 7) paired with a support vector machine is found to be optimal (F1: 0.959; MCC: 0.908), outperforming other approaches. Moreover, models trained on multi-species data sets are shown to generalize better. Our findings suggest that genomic surveillance benefits from task-specific genome data representations. This work provides state-of-the-art, fine-tuned models for identifying and annotating genomic islands that will enable proper detection of transfer of AMR-related genes between species.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf165"},"PeriodicalIF":2.8,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12693543/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145744729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-10eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf153
Sarwar Azam, Abhisek Sahu, Mohammad Kadivella, Aamir Waseem Khan, Mahesh Neupane, Curtis P Van Tassell, Benjamin D Rosen, Ravi Kumar Gandham, Subha Narayan Rath, Subeer S Majumdar
India, home to the world's largest cattle population, hosts native dairy breeds essential to its agricultural economy because of their adaptability and resilience. This study characterizes the genomes of five prominent breeds-Gir, Kankrej, Red Sindhi, Sahiwal, and Tharparkar-highlighting their unique genomic characteristics. The de novo assemblies ranged from 2.70 to 2.78 Gb in size, with 90% of the genomes assembled in just 56 to 1663 scaffolds. The use of reference-guided scaffolding further enhanced these genomes, resulting in 93.3%-96.7% pseudomolecule coverage with strong BUSCO scores (94.1%-95.5%). Comparative analyses revealed 87%-95% synteny with the Brahman genome and identified 19.84-153.16 Mb of structural rearrangements per genome, including inversions, translocations, and duplications. Synteny diversity analysis uncovered 10 643 perfectly collinear regions spanning 87.3 Mb and 6622 hotspots of rearrangement (HOT regions) covering 55.18 Mb. These HOT regions, characterized by high synteny diversity, were significantly enriched with immune-related genes. Moreover, immune-related gene clusters, including major histocompatibility complex, natural killer complex, and leukocyte receptor complex, were identified within HOT regions in the desi reference genome. Our findings provide valuable insights into the genetic diversity of desi cattle breeds. The high-quality genome assemblies generated in this study will serve as valuable resources for future research in genetic improvement, disease resistance, and environmental adaptation.
{"title":"Genome assemblies of Indian <i>desi</i> cattle reveal hotspots of rearrangements and immune-related genetic diversity.","authors":"Sarwar Azam, Abhisek Sahu, Mohammad Kadivella, Aamir Waseem Khan, Mahesh Neupane, Curtis P Van Tassell, Benjamin D Rosen, Ravi Kumar Gandham, Subha Narayan Rath, Subeer S Majumdar","doi":"10.1093/nargab/lqaf153","DOIUrl":"10.1093/nargab/lqaf153","url":null,"abstract":"<p><p>India, home to the world's largest cattle population, hosts native dairy breeds essential to its agricultural economy because of their adaptability and resilience. This study characterizes the genomes of five prominent breeds-Gir, Kankrej, Red Sindhi, Sahiwal, and Tharparkar-highlighting their unique genomic characteristics. The <i>de novo</i> assemblies ranged from 2.70 to 2.78 Gb in size, with 90% of the genomes assembled in just 56 to 1663 scaffolds. The use of reference-guided scaffolding further enhanced these genomes, resulting in 93.3%-96.7% pseudomolecule coverage with strong BUSCO scores (94.1%-95.5%). Comparative analyses revealed 87%-95% synteny with the Brahman genome and identified 19.84-153.16 Mb of structural rearrangements per genome, including inversions, translocations, and duplications. Synteny diversity analysis uncovered 10 643 perfectly collinear regions spanning 87.3 Mb and 6622 hotspots of rearrangement (HOT regions) covering 55.18 Mb. These HOT regions, characterized by high synteny diversity, were significantly enriched with immune-related genes. Moreover, immune-related gene clusters, including major histocompatibility complex, natural killer complex, and leukocyte receptor complex, were identified within HOT regions in the <i>desi</i> reference genome. Our findings provide valuable insights into the genetic diversity of <i>desi</i> cattle breeds. The high-quality genome assemblies generated in this study will serve as valuable resources for future research in genetic improvement, disease resistance, and environmental adaptation.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf153"},"PeriodicalIF":2.8,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12693530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145744401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-10eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf186
Matilde Manetti, Samuel Hiet, Myriam Rahmouni, Jean-Louis Spadoni, Alice Dobiecki, Marco Lamanda, Maxime Tison, Taoufik Labib, Cristina Giuliani, Sigrid Le Clerc, Jean-François Deleuze, Jean-François Zagury
Haplotype blocks in the genome are informative of evolutionary processes and they play a pivotal role in describing the genomic variability across human populations and susceptibility/resistance to diseases. Several software have been developed for haplotype block detection, but they do not distinguish between the impacts of major and minor single nucleotides polymorphism (SNP) alleles. In this study, we present a powerful haploblock detection software, specifically designed for identifying haploblocks associated with SNP minor allele haploblocks (MiA-haploblocks). These haploblocks are particularly important as they can significantly influence phenotypic traits, offering a novel approach for studying genetic associations and complex traits. HaploExplore operates on VCF files containing phased data, exhibiting rapid processing times, and generating user-friendly outputs. Results converge when analyzing populations of 100 individuals or more. A comparative analysis of HaploExplore against other haploblock detection software revealed its superiority in terms of either simplicity, flexibility, or speed, with the unique capability to target minor alleles. HaploExplore will be very useful for evolutionary genomics and for GWAS analysis in human diseases, given that the effects of genetic associations may accumulate within a specific haploblock.
{"title":"HaploExplore, a software specifically designed for the detection of minor allele (MiA-) haploblocks.","authors":"Matilde Manetti, Samuel Hiet, Myriam Rahmouni, Jean-Louis Spadoni, Alice Dobiecki, Marco Lamanda, Maxime Tison, Taoufik Labib, Cristina Giuliani, Sigrid Le Clerc, Jean-François Deleuze, Jean-François Zagury","doi":"10.1093/nargab/lqaf186","DOIUrl":"10.1093/nargab/lqaf186","url":null,"abstract":"<p><p>Haplotype blocks in the genome are informative of evolutionary processes and they play a pivotal role in describing the genomic variability across human populations and susceptibility/resistance to diseases. Several software have been developed for haplotype block detection, but they do not distinguish between the impacts of major and minor single nucleotides polymorphism (SNP) alleles. In this study, we present a powerful haploblock detection software, specifically designed for identifying haploblocks associated with SNP minor allele haploblocks (MiA-haploblocks). These haploblocks are particularly important as they can significantly influence phenotypic traits, offering a novel approach for studying genetic associations and complex traits. HaploExplore operates on VCF files containing phased data, exhibiting rapid processing times, and generating user-friendly outputs. Results converge when analyzing populations of 100 individuals or more. A comparative analysis of HaploExplore against other haploblock detection software revealed its superiority in terms of either simplicity, flexibility, or speed, with the unique capability to target minor alleles. HaploExplore will be very useful for evolutionary genomics and for GWAS analysis in human diseases, given that the effects of genetic associations may accumulate within a specific haploblock.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf186"},"PeriodicalIF":2.8,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12693498/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145744731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-10eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf170
Joshua Rubin, Jan van Waaij, Louis Kraft, Jouni Sirén, Peter Wad Sackett, Gabriel Renaud
Aligning DNA sequences retrieved from fossils or other paleontological artifacts, referred to as ancient DNA (aDNA), is particularly challenging due to the short sequence length and chemical damage which creates a specific pattern of substitution (C[Formula: see text]T and G[Formula: see text]A) in addition to the heightened divergence between the sample and the reference genome thus exacerbating reference bias. This bias can be mitigated by aligning to pangenome graphs to incorporate documented organismic variation, but this approach still suffers from substitution patterns due to chemical damage. We introduce a novel methodology introducing the RYmer index, a variant of the commonly used minimizer index which represents purines (A,G) and pyrimidines (C,T) as R and Y, respectively. This creates an indexing scheme robust to the aforementioned chemical damage. We implemented SAFARI (Sensitive Alignments From A RYmer Index), an aDNA damage-aware version of the pangenome aligner vg giraffe, which uses RYmers to rescue alignments containing deaminated seeds. For highly damaged samples, the recovery rate could be upwards of 10%, an amount which could well affect downstream results. We show that our approach produces more correct alignments from aDNA sequences than current approaches while maintaining a tolerable rate of spurious alignments. In addition, we demonstrate that our algorithm improves the estimate of the rate of aDNA damage, especially for highly damaged samples. Crucially, we show that this improved alignment can directly translate into better insights gained from the data by showcasing its integration with a number of extant pangenome tools.
{"title":"SAFARI: pangenome alignment of ancient DNA using purine/pyrimidine encodings.","authors":"Joshua Rubin, Jan van Waaij, Louis Kraft, Jouni Sirén, Peter Wad Sackett, Gabriel Renaud","doi":"10.1093/nargab/lqaf170","DOIUrl":"10.1093/nargab/lqaf170","url":null,"abstract":"<p><p>Aligning DNA sequences retrieved from fossils or other paleontological artifacts, referred to as ancient DNA (aDNA), is particularly challenging due to the short sequence length and chemical damage which creates a specific pattern of substitution (C[Formula: see text]T and G[Formula: see text]A) in addition to the heightened divergence between the sample and the reference genome thus exacerbating reference bias. This bias can be mitigated by aligning to pangenome graphs to incorporate documented organismic variation, but this approach still suffers from substitution patterns due to chemical damage. We introduce a novel methodology introducing the RYmer index, a variant of the commonly used minimizer index which represents purines (A,G) and pyrimidines (C,T) as R and Y, respectively. This creates an indexing scheme robust to the aforementioned chemical damage. We implemented SAFARI (Sensitive Alignments From A RYmer Index), an aDNA damage-aware version of the pangenome aligner vg giraffe, which uses RYmers to rescue alignments containing deaminated seeds. For highly damaged samples, the recovery rate could be upwards of 10%, an amount which could well affect downstream results. We show that our approach produces more correct alignments from aDNA sequences than current approaches while maintaining a tolerable rate of spurious alignments. In addition, we demonstrate that our algorithm improves the estimate of the rate of aDNA damage, especially for highly damaged samples. Crucially, we show that this improved alignment can directly translate into better insights gained from the data by showcasing its integration with a number of extant pangenome tools.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf170"},"PeriodicalIF":2.8,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12693626/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145744782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-10eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf175
Narges Zali, Osama El Demerdash, Kapeel Chougule, Zhenyuan Lu, Doreen Ware, Bruce Stillman
Yeast is commonly utilized in molecular and cell biology research, and Yarrowia lipolytica is favored by bioengineers due to its ability to produce copious amounts of lipids, chemicals, and enzymes for industrial applications. Y. lipolytica is a dimorphic yeast that can proliferate in aerobic and hydrophobic environments conducive to industrial use. However, there is limited knowledge about the basic molecular biology of this yeast, including how the genome is duplicated and how gene silencing occurs. Genome sequences of Y. lipolytica strains have offered insights into this yeast species and have facilitated the development of new industrial applications. Although previous studies have reported the genome sequence of a few Y. lipolytica strains, it is of value to have more precise sequences and annotation, particularly for studies of the biology of this yeast. To further study and characterize the molecular biology of this microorganism, a high-quality reference genome assembly and annotation has been produced for two related Y. lipolytica strains of the opposite mating type, strain E122 (MATA) and 22301-5 (MATB). The combination of short-read and long-read sequencing of genome DNA and short-read and long-read sequencing of transcript cDNAs allowed the genome assembly and a comparison with a distantly related Yarrowia strain.
{"title":"Genome sequence assembly and annotation of <i>MATA</i> and <i>MATB</i> strains of <i>Yarrowia lipolytica</i>.","authors":"Narges Zali, Osama El Demerdash, Kapeel Chougule, Zhenyuan Lu, Doreen Ware, Bruce Stillman","doi":"10.1093/nargab/lqaf175","DOIUrl":"10.1093/nargab/lqaf175","url":null,"abstract":"<p><p>Yeast is commonly utilized in molecular and cell biology research, and <i>Yarrowia lipolytica</i> is favored by bioengineers due to its ability to produce copious amounts of lipids, chemicals, and enzymes for industrial applications. <i>Y. lipolytica</i> is a dimorphic yeast that can proliferate in aerobic and hydrophobic environments conducive to industrial use. However, there is limited knowledge about the basic molecular biology of this yeast, including how the genome is duplicated and how gene silencing occurs. Genome sequences of <i>Y. lipolytica</i> strains have offered insights into this yeast species and have facilitated the development of new industrial applications. Although previous studies have reported the genome sequence of a few <i>Y. lipolytica</i> strains, it is of value to have more precise sequences and annotation, particularly for studies of the biology of this yeast. To further study and characterize the molecular biology of this microorganism, a high-quality reference genome assembly and annotation has been produced for two related <i>Y. lipolytica</i> strains of the opposite mating type, strain E122 (<i>MATA</i>) and 22301-5 (<i>MATB</i>). The combination of short-read and long-read sequencing of genome DNA and short-read and long-read sequencing of transcript cDNAs allowed the genome assembly and a comparison with a distantly related <i>Yarrowia</i> strain.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf175"},"PeriodicalIF":2.8,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12693490/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145744355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf179
Sofía Prieto León, Ewoud De Troyer, Helena Geys, Koen Van den Berge, Olivier Thas
Removing unwanted variation (RUV) is key for accurate biological interpretation in high-throughput sequencing studies. However, no standardized approach exists for pseudobulked single-cell RNA-sequencing (scRNA-seq) data. Improper implementation of RUV methods may remove biological information, jeopardizing power and false positive control in differential expression analysis. We evaluate the impact of three implementation strategies ('trails') in three RUV methods (RUV2, RUVIII, RUV4) using simulated and real biological signals in pseudobulked scRNA-seq data. Effects of technical noise under confounding and model misspecification conditions are also considered. Additionally, we introduce a novel strategy, RUVIII PBPS, to remove unwanted variation in pseudobulk differential expression analyses with insufficient technical replicates or negative control genes. Our analysis demonstrates that removing unwanted variation per cell type with RUV2 or RUVIII extracts factors associated with technical noise and controls the false discovery rate (FDR), even in the presence of confounding. RUVIII PBPS successfully controls the FDR when other standard RUV methods cannot be used due to missing technical replicates, dependence between the factor of interest and the sources of unwanted variation, and lack of plausible negative control genes.
{"title":"Removal of unwanted variation in pseudobulk analysis of single-cell RNA sequencing data and the leveraging of pseudoreplicates.","authors":"Sofía Prieto León, Ewoud De Troyer, Helena Geys, Koen Van den Berge, Olivier Thas","doi":"10.1093/nargab/lqaf179","DOIUrl":"10.1093/nargab/lqaf179","url":null,"abstract":"<p><p>Removing unwanted variation (RUV) is key for accurate biological interpretation in high-throughput sequencing studies. However, no standardized approach exists for pseudobulked single-cell RNA-sequencing (scRNA-seq) data. Improper implementation of RUV methods may remove biological information, jeopardizing power and false positive control in differential expression analysis. We evaluate the impact of three implementation strategies ('trails') in three RUV methods (RUV2, RUVIII, RUV4) using simulated and real biological signals in pseudobulked scRNA-seq data. Effects of technical noise under confounding and model misspecification conditions are also considered. Additionally, we introduce a novel strategy, RUVIII PBPS, to remove unwanted variation in pseudobulk differential expression analyses with insufficient technical replicates or negative control genes. Our analysis demonstrates that removing unwanted variation per cell type with RUV2 or RUVIII extracts factors associated with technical noise and controls the false discovery rate (FDR), even in the presence of confounding. RUVIII PBPS successfully controls the FDR when other standard RUV methods cannot be used due to missing technical replicates, dependence between the factor of interest and the sources of unwanted variation, and lack of plausible negative control genes.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf179"},"PeriodicalIF":2.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12684399/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145715626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}