Pub Date : 2024-06-06eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae063
Madeline Bellanger, Jose L Figueroa, Lisa Tiemann, Maren L Friesen, Richard Allen White Iii
Biological nitrogen fixation is a fundamental biogeochemical process that transforms molecular nitrogen into biologically available nitrogen via diazotrophic microbes. Diazotrophs anaerobically fix nitrogen using the nitrogenase enzyme which is arranged in three different gene clusters: (i) molybdenum nitrogenase (nifHDK) is the most abundant, followed by it's alternatives, (ii) vanadium nitrogenase (vnfHDK) and (iii) iron nitrogenase (anfHDK). Multiple databases have been constructed as resources for diazotrophic 'omics analysis; however, an integrated database based on whole genome references does not exist. Here, we present NFixDB (Nitrogen Fixation DataBase), a comprehensive integrated whole genome based database for diazotrophs, which includes all nitrogenases (nifHDK, vnfHDK, anfHDK) and nitrogenase-like enzymes (e.g. nflHD) linked to ribosomal RNA operons (16S-5S-23S). NFixDB was computed using Hidden Markov Models (HMMs) against the entire whole genome based Genome Taxonomy Database (GTDB R214), providing searchable reference HMMs for all nitrogenase and nitrogenase-like genes, complete ribosomal RNA operons, both GTDB and NCBI/RefSeq taxonomy, and an SQL database for querying matches. We compared NFixDB to nifH databases from Buckley, Zehr, Mise and FunGene finding extensive evidence of nifH, in addition to vnfH and nflH. NFixDB contains >4000 verified nifHDK sequences contained on 50 unique phyla of bacteria and archaea. NFixDB provides the first comprehensive nitrogenase database available to researchers unlocking diazotrophic microbial potential.
{"title":"NF<i>ix</i>DB (Nitrogen Fixation DataBase)-a comprehensive integrated database for robust 'omics analysis of diazotrophs.","authors":"Madeline Bellanger, Jose L Figueroa, Lisa Tiemann, Maren L Friesen, Richard Allen White Iii","doi":"10.1093/nargab/lqae063","DOIUrl":"10.1093/nargab/lqae063","url":null,"abstract":"<p><p>Biological nitrogen fixation is a fundamental biogeochemical process that transforms molecular nitrogen into biologically available nitrogen via diazotrophic microbes. Diazotrophs anaerobically fix nitrogen using the nitrogenase enzyme which is arranged in three different gene clusters: (i) molybdenum nitrogenase (<i>nifHDK</i>) is the most abundant, followed by it's alternatives, (ii) vanadium nitrogenase (<i>vnfHDK</i>) and (iii) iron nitrogenase (<i>anfHDK</i>). Multiple databases have been constructed as resources for diazotrophic 'omics analysis; however, an integrated database based on whole genome references does not exist. Here, we present NF<i>ix</i>DB (Nitrogen Fixation DataBase), a comprehensive integrated whole genome based database for diazotrophs, which includes all nitrogenases (<i>nifHDK</i>, <i>vnfHDK</i>, <i>anfHDK</i>) and nitrogenase-like enzymes (e.g. <i>nflHD</i>) linked to ribosomal RNA operons (16S-5S-23S). NF<i>ix</i>DB was computed using Hidden Markov Models (HMMs) against the entire whole genome based Genome Taxonomy Database (GTDB R214), providing searchable reference HMMs for all nitrogenase and nitrogenase-like genes, complete ribosomal RNA operons, both GTDB and NCBI/RefSeq taxonomy, and an SQL database for querying matches. We compared NF<i>ix</i>DB to <i>nifH</i> databases from Buckley, Zehr, Mise and FunGene finding extensive evidence of <i>nifH</i>, in addition to <i>vnfH</i> and <i>nflH</i>. NF<i>ix</i>DB contains >4000 verified <i>nifHDK</i> sequences contained on 50 unique phyla of bacteria and archaea. NF<i>ix</i>DB provides the first comprehensive nitrogenase database available to researchers unlocking diazotrophic microbial potential.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae063"},"PeriodicalIF":4.6,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11155484/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141284891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-06eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae067
Rui Yokomori, Takehiro G Kusakabe, Kenta Nakai
Trans-splicing is a post-transcriptional processing event that joins exons from separate RNAs to produce a chimeric RNA. However, the detailed mechanism of trans-splicing remains poorly understood. Here, we characterize trans-spliced genes and provide insights into the mechanism of trans-splicing in the tunicate Ciona. Tunicates are the closest invertebrates to humans, and their genes frequently undergo trans-splicing. Our analysis revealed that, in genes that give rise to both trans-spliced and non-trans-spliced messenger RNAs, trans-splice acceptor sites were preferentially located at the first functional acceptor site, and their paired donor sites were weak in both Ciona and humans. Additionally, we found that Ciona trans-spliced genes had GU- and AU-rich 5' transcribed regions. Our data and findings not only are useful for Ciona research community, but may also aid in a better understanding of the trans-splicing mechanism, potentially advancing the development of gene therapy based on trans-splicing.
{"title":"Characterization of <i>trans</i>-spliced chimeric RNAs: insights into the mechanism of <i>trans</i>-splicing.","authors":"Rui Yokomori, Takehiro G Kusakabe, Kenta Nakai","doi":"10.1093/nargab/lqae067","DOIUrl":"10.1093/nargab/lqae067","url":null,"abstract":"<p><p><i>Trans</i>-splicing is a post-transcriptional processing event that joins exons from separate RNAs to produce a chimeric RNA. However, the detailed mechanism of <i>trans</i>-splicing remains poorly understood. Here, we characterize <i>trans</i>-spliced genes and provide insights into the mechanism of <i>trans</i>-splicing in the tunicate <i>Ciona</i>. Tunicates are the closest invertebrates to humans, and their genes frequently undergo <i>trans</i>-splicing. Our analysis revealed that, in genes that give rise to both <i>trans</i>-spliced and non-<i>trans</i>-spliced messenger RNAs, <i>trans</i>-splice acceptor sites were preferentially located at the first functional acceptor site, and their paired donor sites were weak in both <i>Ciona</i> and humans. Additionally, we found that <i>Ciona trans</i>-spliced genes had GU- and AU-rich 5' transcribed regions. Our data and findings not only are useful for <i>Ciona</i> research community, but may also aid in a better understanding of the <i>trans</i>-splicing mechanism, potentially advancing the development of gene therapy based on <i>trans</i>-splicing.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae067"},"PeriodicalIF":4.6,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11155486/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141284889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-06eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae061
Sudaraka Mallawaarachchi, Gerry Tonkin-Hill, Anna K Pöntinen, Jessica K Calland, Rebecca A Gladstone, Sergio Arredondo-Alonso, Neil MacAlasdair, Harry A Thorpe, Janetta Top, Samuel K Sheppard, David Balding, Nicholas J Croucher, Jukka Corander
Population genomics has revolutionized our ability to study bacterial evolution by enabling data-driven discovery of the genetic architecture of trait variation. Genome-wide association studies (GWAS) have more recently become accompanied by genome-wide epistasis and co-selection (GWES) analysis, which offers a phenotype-free approach to generating hypotheses about selective processes that simultaneously impact multiple loci across the genome. However, existing GWES methods only consider associations between distant pairs of loci within the genome due to the strong impact of linkage-disequilibrium (LD) over short distances. Based on the general functional organisation of genomes it is nevertheless expected that majority of co-selection and epistasis will act within relatively short genomic proximity, on co-variation occurring within genes and their promoter regions, and within operons. Here, we introduce LDWeaver, which enables an exhaustive GWES across both short- and long-range LD, to disentangle likely neutral co-variation from selection. We demonstrate the ability of LDWeaver to efficiently generate hypotheses about co-selection using large genomic surveys of multiple major human bacterial pathogen species and validate several findings using functional annotation and phenotypic measurements. Our approach will facilitate the study of bacterial evolution in the light of rapidly expanding population genomic data.
{"title":"Detecting co-selection through excess linkage disequilibrium in bacterial genomes.","authors":"Sudaraka Mallawaarachchi, Gerry Tonkin-Hill, Anna K Pöntinen, Jessica K Calland, Rebecca A Gladstone, Sergio Arredondo-Alonso, Neil MacAlasdair, Harry A Thorpe, Janetta Top, Samuel K Sheppard, David Balding, Nicholas J Croucher, Jukka Corander","doi":"10.1093/nargab/lqae061","DOIUrl":"10.1093/nargab/lqae061","url":null,"abstract":"<p><p>Population genomics has revolutionized our ability to study bacterial evolution by enabling data-driven discovery of the genetic architecture of trait variation. Genome-wide association studies (GWAS) have more recently become accompanied by genome-wide epistasis and co-selection (GWES) analysis, which offers a phenotype-free approach to generating hypotheses about selective processes that simultaneously impact multiple loci across the genome. However, existing GWES methods only consider associations between distant pairs of loci within the genome due to the strong impact of linkage-disequilibrium (LD) over short distances. Based on the general functional organisation of genomes it is nevertheless expected that majority of co-selection and epistasis will act within relatively short genomic proximity, on co-variation occurring within genes and their promoter regions, and within operons. Here, we introduce LDWeaver, which enables an exhaustive GWES across both short- and long-range LD, to disentangle likely neutral co-variation from selection. We demonstrate the ability of LDWeaver to efficiently generate hypotheses about co-selection using large genomic surveys of multiple major human bacterial pathogen species and validate several findings using functional annotation and phenotypic measurements. Our approach will facilitate the study of bacterial evolution in the light of rapidly expanding population genomic data.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae061"},"PeriodicalIF":4.6,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11155488/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141284890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-04eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae062
Leonid Gorb, Ivan Voiteshenko, Vasyl Hurmach, Margarita Zarudnaya, Alex Nyporko, Tetiana Shyryna, Maksym Platonov, Szczepan Roszak, Bakhtiyor Rasulev
In this computational study, we explore the folding of a particular sequence using various computational tools to produce two-dimensional structures, which are then transformed into three-dimensional structures. We then study the geometry, energetics and dynamics of these structures using full electron quantum-chemical and classical molecular dynamics calculations. Our study focuses on the SARS-CoV-2 RNA fragment GGaGGaGGuguugcaGG and its various structures, including a G-quadruplex and five different hairpins. We examine the impact of two types of counterions (K+ and Na+) and flanking nucleotides on their geometrical characteristics, relative stability and dynamic properties. Our results show that the G-quadruplex structure is the most stable among the constructed hairpins. We confirm its topological stability through molecular dynamics simulations. Furthermore, we observe that the nucleotide loop consisting of seven nucleotides is the most flexible part of the RNA fragment. Additionally, we find that RNA networks of intermolecular hydrogen bonds are highly sensitive to the surrounding environment. Our findings reveal the loss of 79 old hydrogen bonds and the formation of 91 new ones in the case when the G-quadruplex containing flanking nucleotides is additionally stabilized by Na+ counterions.
{"title":"From RNA sequence to its three-dimensional structure: geometrical structure, stability and dynamics of selected fragments of SARS-CoV-2 RNA.","authors":"Leonid Gorb, Ivan Voiteshenko, Vasyl Hurmach, Margarita Zarudnaya, Alex Nyporko, Tetiana Shyryna, Maksym Platonov, Szczepan Roszak, Bakhtiyor Rasulev","doi":"10.1093/nargab/lqae062","DOIUrl":"10.1093/nargab/lqae062","url":null,"abstract":"<p><p>In this computational study, we explore the folding of a particular sequence using various computational tools to produce two-dimensional structures, which are then transformed into three-dimensional structures. We then study the geometry, energetics and dynamics of these structures using full electron quantum-chemical and classical molecular dynamics calculations. Our study focuses on the SARS-CoV-2 RNA fragment GGaGGaGGuguugcaGG and its various structures, including a G-quadruplex and five different hairpins. We examine the impact of two types of counterions (K<sup>+</sup> and Na<sup>+</sup>) and flanking nucleotides on their geometrical characteristics, relative stability and dynamic properties. Our results show that the G-quadruplex structure is the most stable among the constructed hairpins. We confirm its topological stability through molecular dynamics simulations. Furthermore, we observe that the nucleotide loop consisting of seven nucleotides is the most flexible part of the RNA fragment. Additionally, we find that RNA networks of intermolecular hydrogen bonds are highly sensitive to the surrounding environment. Our findings reveal the loss of 79 old hydrogen bonds and the formation of 91 new ones in the case when the G-quadruplex containing flanking nucleotides is additionally stabilized by Na<sup>+</sup> counterions.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae062"},"PeriodicalIF":4.6,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11148665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141248869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae060
Václav Brázda, Lucie Šislerová, Anne Cucchiarini, Jean-Louis Mergny
Current methods of processing archaeological samples combined with advances in sequencing methods lead to disclosure of a large part of H. neanderthalensis and Denisovans genetic information. It is hardly surprising that the genome variability between modern humans, Denisovans and H. neanderthalensis is relatively limited. Genomic studies may provide insight on the metabolism of extinct human species or lineages. Detailed analysis of G-quadruplex sequences in H. neanderthalensis and Denisovans mitochondrial DNA showed us interesting features. Relatively similar patterns in mitochondrial DNA are found compared to modern humans, with one notable exception for H. neanderthalensis. An interesting difference between H. neanderthalensis and H. sapiens corresponds to a motif found in the D-loop region of mtDNA, which is responsible for mitochondrial DNA replication. This area is directly responsible for the number of mitochondria and consequently for the efficient energy metabolism of cell. H. neanderthalensis harbor a long uninterrupted run of guanines in this region, which may cause problems for replication, in contrast with H. sapiens, for which this run is generally shorter and interrupted. One may propose that the predominant H. sapiens motif provided a selective advantage for modern humans regarding mtDNA replication and function.
目前处理考古样本的方法与测序方法的进步相结合,揭示了大部分尼安德特人和丹尼索瓦人的遗传信息。现代人、丹尼索瓦人和尼安德特人的基因组变异性相对有限,这一点不足为奇。基因组研究可能有助于深入了解已灭绝的人类物种或种系的新陈代谢。对尼安德特人和丹尼索瓦人线粒体 DNA 中 G-四叠体序列的详细分析向我们展示了有趣的特征。与现代人相比,线粒体 DNA 中的模式相对相似,但尼安德特人有一个明显的例外。尼安德特人与智人之间的一个有趣差异是在线粒体 DNA 的 D 环区域发现的一个图案,该区域负责线粒体 DNA 的复制。该区域直接决定了线粒体的数量,从而决定了细胞能量代谢的效率。尼安德特人在这一区域有很长的不间断鸟嘌呤,这可能会给复制带来问题,而智人则不同,他们的鸟嘌呤一般较短且有间断。有人可能会认为,现代人在 mtDNA 复制和功能方面的选择性优势是以 H. sapiens 为主。
{"title":"G-quadruplex propensity in <i>H. neanderthalensis</i>, <i>H. sapiens</i> and Denisovans mitochondrial genomes.","authors":"Václav Brázda, Lucie Šislerová, Anne Cucchiarini, Jean-Louis Mergny","doi":"10.1093/nargab/lqae060","DOIUrl":"10.1093/nargab/lqae060","url":null,"abstract":"<p><p>Current methods of processing archaeological samples combined with advances in sequencing methods lead to disclosure of a large part of <i>H. neanderthalensis</i> and Denisovans genetic information. It is hardly surprising that the genome variability between modern humans, Denisovans and <i>H. neanderthalensis</i> is relatively limited. Genomic studies may provide insight on the metabolism of extinct human species or lineages. Detailed analysis of G-quadruplex sequences in <i>H. neanderthalensis</i> and Denisovans mitochondrial DNA showed us interesting features. Relatively similar patterns in mitochondrial DNA are found compared to modern humans, with one notable exception for <i>H. neanderthalensis</i>. An interesting difference between <i>H. neanderthalensis</i> and <i>H. sapiens</i> corresponds to a motif found in the D-loop region of mtDNA, which is responsible for mitochondrial DNA replication. This area is directly responsible for the number of mitochondria and consequently for the efficient energy metabolism of cell<i>. H. neanderthalensis</i> harbor a long uninterrupted run of guanines in this region, which may cause problems for replication, in contrast with <i>H. sapiens</i>, for which this run is generally shorter and interrupted. One may propose that the predominant <i>H. sapiens</i> motif provided a selective advantage for modern humans regarding mtDNA replication and function.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae060"},"PeriodicalIF":4.6,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11137754/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141180982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With an increase in accuracy and throughput of long-read sequencing technologies, they are rapidly being assimilated into the single-cell sequencing pipelines. For transcriptome sequencing, these techniques provide RNA isoform-level information in addition to the gene expression profiles. Long-read sequencing technologies not only help in uncovering complex patterns of cell-type specific splicing, but also offer unprecedented insights into the origin of cellular complexity and thus potentially new avenues for drug development. Additionally, single-cell long-read DNA sequencing enables high-quality assemblies, structural variant detection, haplotype phasing, resolving high-complexity regions, and characterization of epigenetic modifications. Given that significant progress has primarily occurred in single-cell RNA isoform sequencing (scRiso-seq), this review will delve into these advancements in depth and highlight the practical considerations and operational challenges, particularly pertaining to downstream analysis. We also aim to offer a concise introduction to complementary technologies for single-cell sequencing of the genome, epigenome and epitranscriptome. We conclude by identifying certain key areas of innovation that may drive these technologies further and foster more widespread application in biomedical science.
随着长线程测序技术的准确性和通量的提高,它们正迅速被纳入单细胞测序流水线。在转录组测序方面,除了基因表达谱之外,这些技术还能提供 RNA 同工酶水平的信息。长读数测序技术不仅有助于揭示细胞类型特异性剪接的复杂模式,还能为细胞复杂性的起源提供前所未有的见解,从而为药物开发提供潜在的新途径。此外,单细胞长线程 DNA 测序技术还能进行高质量的组装、结构变异检测、单体型分期、解析高复杂性区域以及表观遗传修饰的表征。鉴于单细胞 RNA 同工酶测序(scRiso-seq)已取得重大进展,本综述将深入探讨这些进展,并着重介绍实际注意事项和操作挑战,尤其是与下游分析有关的问题。我们还将简要介绍基因组、表观基因组和表观转录组单细胞测序的互补技术。最后,我们确定了一些关键的创新领域,这些领域可能会进一步推动这些技术的发展,并促进其在生物医学科学中的更广泛应用。
{"title":"Advances in single-cell long-read sequencing technologies.","authors":"Pallavi Gupta, Hannah O'Neill, Ernst J Wolvetang, Aniruddha Chatterjee, Ishaan Gupta","doi":"10.1093/nargab/lqae047","DOIUrl":"10.1093/nargab/lqae047","url":null,"abstract":"<p><p>With an increase in accuracy and throughput of long-read sequencing technologies, they are rapidly being assimilated into the single-cell sequencing pipelines. For transcriptome sequencing, these techniques provide RNA isoform-level information in addition to the gene expression profiles. Long-read sequencing technologies not only help in uncovering complex patterns of cell-type specific splicing, but also offer unprecedented insights into the origin of cellular complexity and thus potentially new avenues for drug development. Additionally, single-cell long-read DNA sequencing enables high-quality assemblies, structural variant detection, haplotype phasing, resolving high-complexity regions, and characterization of epigenetic modifications. Given that significant progress has primarily occurred in single-cell RNA isoform sequencing (scRiso-seq), this review will delve into these advancements in depth and highlight the practical considerations and operational challenges, particularly pertaining to downstream analysis. We also aim to offer a concise introduction to complementary technologies for single-cell sequencing of the genome, epigenome and epitranscriptome. We conclude by identifying certain key areas of innovation that may drive these technologies further and foster more widespread application in biomedical science.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae047"},"PeriodicalIF":4.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11106032/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141076780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-20eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae052
Zhihao Guo, Ying Ni, Lu Tan, Yanwen Shao, Lianwei Ye, Sheng Chen, Runsheng Li
Nanopore sequencing technologies have enabled the direct detection of base modifications in DNA or RNA molecules. Despite these advancements, the tools for visualizing electrical current, essential for analyzing base modifications, are often lacking in clarity and compatibility with diverse nanopore pipelines. Here, we present Nanopore Current Events Magnifier (nanoCEM, https://github.com/lrslab/nanoCEM), a Python command-line tool designed to facilitate the identification of DNA/RNA modification sites through enhanced visualization and statistical analysis. Compatible with the four preprocessing methods including 'f5c resquiggle', 'f5c eventalign', 'Tombo' and 'move table', nanoCEM is applicable to RNA and DNA analysis across multiple flow cell types. By utilizing rescaling techniques and calculating various statistical features, nanoCEM provides more accurate and comparable visualization of current events, allowing researchers to effectively observe differences between samples and showcase the modified sites.
{"title":"Nanopore Current Events Magnifier (nanoCEM): a novel tool for visualizing current events at modification sites of nanopore sequencing.","authors":"Zhihao Guo, Ying Ni, Lu Tan, Yanwen Shao, Lianwei Ye, Sheng Chen, Runsheng Li","doi":"10.1093/nargab/lqae052","DOIUrl":"10.1093/nargab/lqae052","url":null,"abstract":"<p><p>Nanopore sequencing technologies have enabled the direct detection of base modifications in DNA or RNA molecules. Despite these advancements, the tools for visualizing electrical current, essential for analyzing base modifications, are often lacking in clarity and compatibility with diverse nanopore pipelines. Here, we present Nanopore Current Events Magnifier (nanoCEM, https://github.com/lrslab/nanoCEM), a Python command-line tool designed to facilitate the identification of DNA/RNA modification sites through enhanced visualization and statistical analysis. Compatible with the four preprocessing methods including 'f5c resquiggle', 'f5c eventalign', 'Tombo' and 'move table', nanoCEM is applicable to RNA and DNA analysis across multiple flow cell types. By utilizing rescaling techniques and calculating various statistical features, nanoCEM provides more accurate and comparable visualization of current events, allowing researchers to effectively observe differences between samples and showcase the modified sites.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae052"},"PeriodicalIF":4.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11106030/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141076808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-20eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae054
Dmitry E Mylarshchikov, Arina I Nikolskaya, Olesja D Bogomaz, Anastasia A Zharikova, Andrey A Mironov
Chromatin-associated non-coding RNAs play important roles in various cellular processes by targeting genomic loci. Two types of genome-wide NGS experiments exist to detect such targets: 'one-to-al', which focuses on targets of a single RNA, and 'all-to-al', which captures targets of all RNAs in a sample. As with many NGS experiments, they are prone to biases and noise, so it becomes essential to detect 'peaks'-specific interactions of an RNA with genomic targets. Here, we present BaRDIC-Binomial RNA-DNA Interaction Caller-a tailored method to detect peaks in both types of RNA-DNA interaction data. BaRDIC is the first tool to simultaneously take into account the two most prominent biases in the data: chromatin heterogeneity and distance-dependent decay of interaction frequency. Since RNAs differ in their interaction preferences, BaRDIC adapts peak sizes according to the abundances and contact patterns of individual RNAs. These features enable BaRDIC to make more robust predictions than currently applied peak-calling algorithms and better handle the characteristic sparsity of all-to-all data. The BaRDIC package is freely available at https://github.com/dmitrymyl/BaRDIC.
{"title":"BaRDIC: robust peak calling for RNA-DNA interaction data.","authors":"Dmitry E Mylarshchikov, Arina I Nikolskaya, Olesja D Bogomaz, Anastasia A Zharikova, Andrey A Mironov","doi":"10.1093/nargab/lqae054","DOIUrl":"10.1093/nargab/lqae054","url":null,"abstract":"<p><p>Chromatin-associated non-coding RNAs play important roles in various cellular processes by targeting genomic loci. Two types of genome-wide NGS experiments exist to detect such targets: 'one-to-al', which focuses on targets of a single RNA, and 'all-to-al', which captures targets of all RNAs in a sample. As with many NGS experiments, they are prone to biases and noise, so it becomes essential to detect 'peaks'-specific interactions of an RNA with genomic targets. Here, we present BaRDIC-Binomial RNA-DNA Interaction Caller-a tailored method to detect peaks in both types of RNA-DNA interaction data. BaRDIC is the first tool to simultaneously take into account the two most prominent biases in the data: chromatin heterogeneity and distance-dependent decay of interaction frequency. Since RNAs differ in their interaction preferences, BaRDIC adapts peak sizes according to the abundances and contact patterns of individual RNAs. These features enable BaRDIC to make more robust predictions than currently applied peak-calling algorithms and better handle the characteristic sparsity of all-to-all data. The BaRDIC package is freely available at https://github.com/dmitrymyl/BaRDIC.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae054"},"PeriodicalIF":4.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11106031/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141076783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-25eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae031
Friederike Hanssen, Maxime U Garcia, Lasse Folkersen, Anders Sune Pedersen, Francesco Lescai, Susanne Jodoin, Edmund Miller, Matthias Seybold, Oskar Wacker, Nicholas Smith, Gisela Gabernet, Sven Nahnsen
DNA variation analysis has become indispensable in many aspects of modern biomedicine, most prominently in the comparison of normal and tumor samples. Thousands of samples are collected in local sequencing efforts and public databases requiring highly scalable, portable, and automated workflows for streamlined processing. Here, we present nf-core/sarek 3, a well-established, comprehensive variant calling and annotation pipeline for germline and somatic samples. It is suitable for any genome with a known reference. We present a full rewrite of the original pipeline showing a significant reduction of storage requirements by using the CRAM format and runtime by increasing intra-sample parallelization. Both are leading to a 70% cost reduction in commercial clouds enabling users to do large-scale and cross-platform data analysis while keeping costs and CO2 emissions low. The code is available at https://nf-co.re/sarek.
{"title":"Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery.","authors":"Friederike Hanssen, Maxime U Garcia, Lasse Folkersen, Anders Sune Pedersen, Francesco Lescai, Susanne Jodoin, Edmund Miller, Matthias Seybold, Oskar Wacker, Nicholas Smith, Gisela Gabernet, Sven Nahnsen","doi":"10.1093/nargab/lqae031","DOIUrl":"10.1093/nargab/lqae031","url":null,"abstract":"<p><p>DNA variation analysis has become indispensable in many aspects of modern biomedicine, most prominently in the comparison of normal and tumor samples. Thousands of samples are collected in local sequencing efforts and public databases requiring highly scalable, portable, and automated workflows for streamlined processing. Here, we present nf-core/sarek 3, a well-established, comprehensive variant calling and annotation pipeline for germline and somatic samples. It is suitable for any genome with a known reference. We present a full rewrite of the original pipeline showing a significant reduction of storage requirements by using the CRAM format and runtime by increasing intra-sample parallelization. Both are leading to a 70% cost reduction in commercial clouds enabling users to do large-scale and cross-platform data analysis while keeping costs and CO<sub>2</sub> emissions low. The code is available at https://nf-co.re/sarek.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae031"},"PeriodicalIF":4.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11044436/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140868473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-04eCollection Date: 2024-06-01DOI: 10.1093/nargab/lqae029
Nikol Chantzi, Manvita Mareboina, Maxwell A Konnaris, Austin Montgomery, Michail Patsakis, Ioannis Mouratidis, Ilias Georgakopoulos-Soares
The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
{"title":"The determinants of the rarity of nucleic and peptide short sequences in nature.","authors":"Nikol Chantzi, Manvita Mareboina, Maxwell A Konnaris, Austin Montgomery, Michail Patsakis, Ioannis Mouratidis, Ilias Georgakopoulos-Soares","doi":"10.1093/nargab/lqae029","DOIUrl":"10.1093/nargab/lqae029","url":null,"abstract":"<p><p>The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the <i>R</i>² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved <i>R</i>² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae029"},"PeriodicalIF":5.4,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10993293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140872441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}