Pub Date : 2025-08-05eCollection Date: 2025-09-01DOI: 10.1093/nargab/lqaf107
Chiara Parrillo, Michele Kulesko, Federica Persiani, Lorenzo De Marco, Paolo Petescia, Luca Mastrantoni, Camilla Nero, Angelo Minucci, Luciano Giacò
cBioPortal has established itself as a widely used platform for exploring and visualizing multidimensional cancer data. Additionally, users have the option to upload their own cancer study for a comprehensive experience. However, the uploading step can be challenging due to the numerous files required by the platform, as well as the meticulous review of genomic alterations that need to be included in the study. Therefore, there is an increasing need for efficient data management solutions to facilitate the creation of studies in cBioPortal and optimize user experience by streamlining research workflows. In this application note, we present Varan, an innovative data management tool developed to help users at the initial stage of cancer genomic studies' upload, enhancing the data preparation process. Varan addresses challenges related to data formatting, filtering variants based on annotation, metadata file creation, quality checks, and study versioning, thereby enabling researchers to shorten the preparation process time and have control over the type of data to be uploaded to cBioPortal. In conclusion, Varan significantly improves the efficiency and accuracy of preparing cancer genomic studies for cBioPortal, ultimately enhancing the user experience and advancing cancer research through streamlined data management.
{"title":"Varan: a tool for managing mutational data and creating cancer studies in cBioPortal.","authors":"Chiara Parrillo, Michele Kulesko, Federica Persiani, Lorenzo De Marco, Paolo Petescia, Luca Mastrantoni, Camilla Nero, Angelo Minucci, Luciano Giacò","doi":"10.1093/nargab/lqaf107","DOIUrl":"10.1093/nargab/lqaf107","url":null,"abstract":"<p><p>cBioPortal has established itself as a widely used platform for exploring and visualizing multidimensional cancer data. Additionally, users have the option to upload their own cancer study for a comprehensive experience. However, the uploading step can be challenging due to the numerous files required by the platform, as well as the meticulous review of genomic alterations that need to be included in the study. Therefore, there is an increasing need for efficient data management solutions to facilitate the creation of studies in cBioPortal and optimize user experience by streamlining research workflows. In this application note, we present Varan, an innovative data management tool developed to help users at the initial stage of cancer genomic studies' upload, enhancing the data preparation process. Varan addresses challenges related to data formatting, filtering variants based on annotation, metadata file creation, quality checks, and study versioning, thereby enabling researchers to shorten the preparation process time and have control over the type of data to be uploaded to cBioPortal. In conclusion, Varan significantly improves the efficiency and accuracy of preparing cancer genomic studies for cBioPortal, ultimately enhancing the user experience and advancing cancer research through streamlined data management.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf107"},"PeriodicalIF":2.8,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12576050/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145432428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-26eCollection Date: 2025-06-01DOI: 10.1093/nargab/lqaf085
Lorenzo Colombini, Francesco Santoro, Mariana Tirziu, Anna Maria Cuppone, Gianni Pozzi, Francesco Iannelli
Long inverted repeats (LIRs) of DNA sequences longer than 30 kb are rare in prokaryotes. Here, we identified two 69.9-kb LIRs in the genome of Lactobacillus crispatus M247_Siena, a derivative of strain M247. Complete genome sequence of M247_Siena was determined using Nanopore and Illumina technologies, while genome structure was analyzed using ultra-long Nanopore read mapping and polymerase chain reaction (PCR). In the parental M247 genome, there was only one copy of the 69.9-kb segment, while a 15.4-kb DNA segment was present instead of the second 69.9-kb segment copy. Both segments were delimited by the same insertion sequences (IS1201 and ISLcr2), and PCR analysis of the M247 population revealed low rates (∼1.28 per 104 chromosomes) of chromosomal rearrangements involving these regions. In contrast, the 69.9-kb LIRs in M247_Siena increased genomic instability, as evidenced by two alternative chromosomal structures detected at frequencies of 23.3% and 76.7% (∼1 out of 5 chromosomes). Comparative analysis of L. crispatus genomes revealed no LIRs similar to those of M247_Siena. However, long repeats of other DNA segments and chromosomal rearrangements, mostly associated with insertion sequences, were detected in 8 and 9 out of 25 L. crispatus genomes, respectively, highlighting genomic instability as a trait of the species.
{"title":"A 69.9-kb long inverted repeat increases genome instability in a strain of <i>Lactobacillus crispatus</i>.","authors":"Lorenzo Colombini, Francesco Santoro, Mariana Tirziu, Anna Maria Cuppone, Gianni Pozzi, Francesco Iannelli","doi":"10.1093/nargab/lqaf085","DOIUrl":"10.1093/nargab/lqaf085","url":null,"abstract":"<p><p>Long inverted repeats (LIRs) of DNA sequences longer than 30 kb are rare in prokaryotes. Here, we identified two 69.9-kb LIRs in the genome of <i>Lactobacillus crispatus</i> M247_Siena, a derivative of strain M247. Complete genome sequence of M247_Siena was determined using Nanopore and Illumina technologies, while genome structure was analyzed using ultra-long Nanopore read mapping and polymerase chain reaction (PCR). In the parental M247 genome, there was only one copy of the 69.9-kb segment, while a 15.4-kb DNA segment was present instead of the second 69.9-kb segment copy. Both segments were delimited by the same insertion sequences (IS<i>1201</i> and IS<i>Lcr2</i>), and PCR analysis of the M247 population revealed low rates (∼1.28 per 10<sup>4</sup> chromosomes) of chromosomal rearrangements involving these regions. In contrast, the 69.9-kb LIRs in M247_Siena increased genomic instability, as evidenced by two alternative chromosomal structures detected at frequencies of 23.3% and 76.7% (∼1 out of 5 chromosomes). Comparative analysis of <i>L. crispatus</i> genomes revealed no LIRs similar to those of M247_Siena. However, long repeats of other DNA segments and chromosomal rearrangements, mostly associated with insertion sequences, were detected in 8 and 9 out of 25 <i>L. crispatus</i> genomes, respectively, highlighting genomic instability as a trait of the species.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf085"},"PeriodicalIF":4.0,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12199158/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144508721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-26eCollection Date: 2025-06-01DOI: 10.1093/nargab/lqaf088
Yaxin Zhang, Qiqin Wu, Ying Zhou, Qingyu Cheng, Tengchuan Jin
With the rapid evolution of next-generation sequencing technology, numerous tools have emerged across multiple stages in the human genome analysis, complicating the assembly of an appropriate pipeline. To address this challenge, there is a pressing need for an efficient and user-friendly tool that combines extensive features with intuitive operation to streamline the process. Here we introduced hDNApipe, a highly flexible end-to-end pipeline tool designed for the analysis and interpretation of human genomic sequencing data. It is developed using bash scripts and the Python standard graphical user interface library Tkinter, which endows it with excellent usability and accessibility. This pipeline directly obtains variants and associated information, and also optionally enables the visualization of variants and downstream analysis. hDNApipe features dual-mode operation with both the command-line interface and graphical user interface, and provides multiple parameter options that enable users to conduct customized analysis. It features an extraordinarily convenient installation process with a dedicated docker setup, eliminating the complexity of manually installing dependencies. It has been tested on a Linux server using publicly available data. Furthermore, benchmarking with other available pipelines was conducted from alignment to variant calling, demonstrating hDNApipe's outstanding performance in terms of time consumption, precision, and sensitivity.
{"title":"hDNApipe: streamlining human genome analysis and interpretation with an intuitive and user-friendly interface.","authors":"Yaxin Zhang, Qiqin Wu, Ying Zhou, Qingyu Cheng, Tengchuan Jin","doi":"10.1093/nargab/lqaf088","DOIUrl":"10.1093/nargab/lqaf088","url":null,"abstract":"<p><p>With the rapid evolution of next-generation sequencing technology, numerous tools have emerged across multiple stages in the human genome analysis, complicating the assembly of an appropriate pipeline. To address this challenge, there is a pressing need for an efficient and user-friendly tool that combines extensive features with intuitive operation to streamline the process. Here we introduced hDNApipe, a highly flexible end-to-end pipeline tool designed for the analysis and interpretation of human genomic sequencing data. It is developed using bash scripts and the Python standard graphical user interface library Tkinter, which endows it with excellent usability and accessibility. This pipeline directly obtains variants and associated information, and also optionally enables the visualization of variants and downstream analysis. hDNApipe features dual-mode operation with both the command-line interface and graphical user interface, and provides multiple parameter options that enable users to conduct customized analysis. It features an extraordinarily convenient installation process with a dedicated docker setup, eliminating the complexity of manually installing dependencies. It has been tested on a Linux server using publicly available data. Furthermore, benchmarking with other available pipelines was conducted from alignment to variant calling, demonstrating hDNApipe's outstanding performance in terms of time consumption, precision, and sensitivity.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf088"},"PeriodicalIF":4.0,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12199140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144508722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-23eCollection Date: 2025-06-01DOI: 10.1093/nargab/lqaf087
Willem T K Maassen, Lennart F Johansson, Bart Charbon, Dennis Hendriksen, Sander van den Hoek, Mariska K Slofstra, Renée Mulder, Martine T Meems-Veldhuis, Robert Sietsma, Henny H Lemmink, Cleo C van Diemen, Mariëlle E van Gijn, Morris A Swertz, Kasper J van der Velde
Achieving high yield in genetics research and genome diagnostics is a significant challenge because it requires a combination of multiple strategies and large-scale genomic analysis using the latest methods. Existing diagnostic software infrastructures are often unable to cope with high demands for versatility and scalability. We developed MOLGENIS VIP, a flexible, scalable, high-throughput, open-source, and "end-to-end" pipeline to process different types of sequencing data into portable, prioritized variant lists for immediate clinical interpretation in a wide variety of scenarios. VIP supports interpretation of short- and long-read sequencing data, using best-practice annotations and classification trees without complex IT infrastructures. VIP is developed within the long-living MOLGENIS open-source project to provide sustainability and has integrated feedback from a growing international community of users. VIP has undergone genome diagnostic laboratory testing and harnesses experiences from multiple Dutch, European, Canadian, and African diagnostic and infrastructural initiatives (VKGL, EU-Solve-RD, EJP-RD, CINECA, GA4GH). We provide a step-by-step protocol for installing and using VIP. We demonstrate VIP using 25 664 previously classified variants from the VKGL, and 18 and 41 diagnosed patients from a routine diagnostics and a Solve-RD research cohort, respectively. We believe that VIP accelerates causal variant detection and innovation in genome diagnostics and research.
{"title":"MOLGENIS VIP: an end-to-end DNA variant interpretation pipeline for research and diagnostics configurable to support rapid implementation of new methods.","authors":"Willem T K Maassen, Lennart F Johansson, Bart Charbon, Dennis Hendriksen, Sander van den Hoek, Mariska K Slofstra, Renée Mulder, Martine T Meems-Veldhuis, Robert Sietsma, Henny H Lemmink, Cleo C van Diemen, Mariëlle E van Gijn, Morris A Swertz, Kasper J van der Velde","doi":"10.1093/nargab/lqaf087","DOIUrl":"10.1093/nargab/lqaf087","url":null,"abstract":"<p><p>Achieving high yield in genetics research and genome diagnostics is a significant challenge because it requires a combination of multiple strategies and large-scale genomic analysis using the latest methods. Existing diagnostic software infrastructures are often unable to cope with high demands for versatility and scalability. We developed MOLGENIS VIP, a flexible, scalable, high-throughput, open-source, and \"end-to-end\" pipeline to process different types of sequencing data into portable, prioritized variant lists for immediate clinical interpretation in a wide variety of scenarios. VIP supports interpretation of short- and long-read sequencing data, using best-practice annotations and classification trees without complex IT infrastructures. VIP is developed within the long-living MOLGENIS open-source project to provide sustainability and has integrated feedback from a growing international community of users. VIP has undergone genome diagnostic laboratory testing and harnesses experiences from multiple Dutch, European, Canadian, and African diagnostic and infrastructural initiatives (VKGL, EU-Solve-RD, EJP-RD, CINECA, GA4GH). We provide a step-by-step protocol for installing and using VIP. We demonstrate VIP using 25 664 previously classified variants from the VKGL, and 18 and 41 diagnosed patients from a routine diagnostics and a Solve-RD research cohort, respectively. We believe that VIP accelerates causal variant detection and innovation in genome diagnostics and research.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf087"},"PeriodicalIF":4.0,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12205968/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mollicutes, known as the simplest bacteria with streamlined genomes, were traditionally thought to evolve mainly through gene loss. Recent studies have highlighted their rapid evolutionary capabilities and genetic exchange within individual genomes; however, their evolutionary trajectory remains elusive. By comprehensive screening 1433 available Mollicutes genomes, we revealed widespread horizontal gene transfer (HGT) in 83.9% of investigated species. These genes involve type IV secretion systems and DNA integration, inferring the unique role of integrative conjugative elements (ICEs) or integrative and mobilizable elements (IMEs) as self-transmissible genetic elements. We systematically identified 263 ICEs/IMEs across most Mollicutes genera, being intact or fragmented, showing a strong correlation with HGT frequency (cor 0.573, P = .002). Their transfer tendency was highlighted across species sharing ecological niches, notably in livestock-associated mycoplasmas and insect-vectored spiroplasmas. ICEs/IMEs not only act as gene shuttles ferrying various phenotypic genes, but also promote increased large-scale chromosomal transfer events, shaping the host genomes profoundly. Additionally, we provided novel evidence that Ureaplasma ICE facilitates genetic exchange and the spread of antibiotic resistance gene tet(M) among other pathogens. These findings suggest that, despite the gene-loss pressure associated with the compact genomes of Mollicutes, ICEs/IMEs play a crucial role by introducing substantial genetic resources, providing essential opportunities for evolutionary adaptation.
毛菌被认为是最简单的具有流线型基因组的细菌,传统上认为它主要是通过基因丢失而进化的。最近的研究强调了它们在个体基因组内的快速进化能力和遗传交换;然而,它们的进化轨迹仍然难以捉摸。通过对1433个Mollicutes基因组的综合筛选,83.9%的Mollicutes存在广泛的水平基因转移(HGT)。这些基因涉及IV型分泌系统和DNA整合,推断整合共轭元件(ICEs)或整合可移动元件(IMEs)作为自传递遗传元件的独特作用。我们系统地在大多数Mollicutes属中鉴定了263个ICEs/ ime,这些ICEs/ ime是完整的或破碎的,与HGT频率有很强的相关性(or 0.573, P = 0.002)。它们在共享生态位的物种之间的转移趋势突出,特别是在家畜相关支原体和昆虫媒介螺旋体中。ICEs/ ime不仅充当各种表型基因的基因穿梭者,而且还促进大规模染色体转移事件的增加,深刻地塑造了宿主基因组。此外,我们提供了新的证据,表明ICE脲原体促进了基因交换和抗生素耐药基因tet(M)在其他病原体中的传播。这些发现表明,尽管Mollicutes紧凑的基因组带来了基因丢失压力,但ice / ime通过引入大量遗传资源发挥了至关重要的作用,为进化适应提供了必要的机会。
{"title":"Comprehensive profiling of integrative conjugative elements (ICEs) in Mollicutes: distinct catalysts of gene flow and genome shaping.","authors":"Zili Chai, Zhiyun Guo, Xinxin Chen, Zilong Yang, Xia Wang, Fengwei Zhang, Fuqiang Kang, Wenting Liu, Shuang Liang, Hongguang Ren, Junjie Yue, Yuan Jin","doi":"10.1093/nargab/lqaf083","DOIUrl":"10.1093/nargab/lqaf083","url":null,"abstract":"<p><p>Mollicutes, known as the simplest bacteria with streamlined genomes, were traditionally thought to evolve mainly through gene loss. Recent studies have highlighted their rapid evolutionary capabilities and genetic exchange within individual genomes; however, their evolutionary trajectory remains elusive. By comprehensive screening 1433 available Mollicutes genomes, we revealed widespread horizontal gene transfer (HGT) in 83.9% of investigated species. These genes involve type IV secretion systems and DNA integration, inferring the unique role of integrative conjugative elements (ICEs) or integrative and mobilizable elements (IMEs) as self-transmissible genetic elements. We systematically identified 263 ICEs/IMEs across most Mollicutes genera, being intact or fragmented, showing a strong correlation with HGT frequency (cor 0.573, <i>P</i> = .002). Their transfer tendency was highlighted across species sharing ecological niches, notably in livestock-associated mycoplasmas and insect-vectored spiroplasmas. ICEs/IMEs not only act as gene shuttles ferrying various phenotypic genes, but also promote increased large-scale chromosomal transfer events, shaping the host genomes profoundly. Additionally, we provided novel evidence that <i>Ureaplasma</i> ICE facilitates genetic exchange and the spread of antibiotic resistance gene <i>tet(M)</i> among other pathogens. These findings suggest that, despite the gene-loss pressure associated with the compact genomes of Mollicutes, ICEs/IMEs play a crucial role by introducing substantial genetic resources, providing essential opportunities for evolutionary adaptation.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf083"},"PeriodicalIF":4.0,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12205969/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Classifying non-coding RNA (ncRNA) sequences, particularly mirtrons, is essential for elucidating gene regulation mechanisms. However, the prevalent class imbalance in ncRNA datasets presents significant challenges, often resulting in overfitting and diminished generalization in machine learning models. In this study, GENNUS (GENerative approaches for NUcleotide Sequences) is proposed, introducing novel data augmentation strategies using generative adversarial networks (GANs) and synthetic minority over-sampling technique (SMOTE) to enhance mirtron and canonical microRNA (miRNA) classification performance. Our GAN-based methods effectively generate high-quality synthetic data that capture the intricate patterns and diversity of real mirtron sequences, eliminating the need for extensive feature engineering. Through four experiments, it is demonstrated that models trained on a combination of real and GAN-generated data improve classification accuracy compared to traditional SMOTE techniques or only with real data. Our findings reveal that GANs enhance model performance and provide a richer representation of minority classes, thus improving generalization capabilities across various machine learning frameworks. This work highlights the transformative potential of synthetic data generation in addressing data limitations in genomics, offering a pathway for more effective and scalable mirtron and canonical miRNA classification methodologies. GENNUS is available at https://github.com/chiquitto/GENNUS; and https://doi.org/10.6084/m9.figshare.28207328.
{"title":"GENNUS: generative approaches for nucleotide sequences enhance mirtron classification.","authors":"Alisson Gaspar Chiquitto, Liliane Santana Oliveira, Pedro Henrique Bugatti, Priscila Tiemi Maeda Saito, Mark Basham, Roberto Tadeu Raittz, Alexandre Rossi Paschoal","doi":"10.1093/nargab/lqaf072","DOIUrl":"10.1093/nargab/lqaf072","url":null,"abstract":"<p><p>Classifying non-coding RNA (ncRNA) sequences, particularly mirtrons, is essential for elucidating gene regulation mechanisms. However, the prevalent class imbalance in ncRNA datasets presents significant challenges, often resulting in overfitting and diminished generalization in machine learning models. In this study, GENNUS (GENerative approaches for NUcleotide Sequences) is proposed, introducing novel data augmentation strategies using generative adversarial networks (GANs) and synthetic minority over-sampling technique (SMOTE) to enhance mirtron and canonical microRNA (miRNA) classification performance. Our GAN-based methods effectively generate high-quality synthetic data that capture the intricate patterns and diversity of real mirtron sequences, eliminating the need for extensive feature engineering. Through four experiments, it is demonstrated that models trained on a combination of real and GAN-generated data improve classification accuracy compared to traditional SMOTE techniques or only with real data. Our findings reveal that GANs enhance model performance and provide a richer representation of minority classes, thus improving generalization capabilities across various machine learning frameworks. This work highlights the transformative potential of synthetic data generation in addressing data limitations in genomics, offering a pathway for more effective and scalable mirtron and canonical miRNA classification methodologies. GENNUS is available at https://github.com/chiquitto/GENNUS; and https://doi.org/10.6084/m9.figshare.28207328.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf072"},"PeriodicalIF":4.0,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204755/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-20eCollection Date: 2025-06-01DOI: 10.1093/nargab/lqaf032
Vasilis F Ntasis, Roderic Guigó
The precise coordination of important biological processes, such as differentiation and development, relies heavily on the regulation of gene expression. In eukaryotic cells, understanding the distribution of RNA transcripts between the nucleus and cytosol is essential for gaining valuable insights into the process of protein production. The most efficient way to estimate the levels of RNA species genome-wide is through RNA sequencing (RNAseq). While RNAseq can be performed separately in the nucleus and in the cytosol, comparing transcript levels between compartments is challenging since measurements are relative to the unknown total RNA volume. Here, we show theoretically that if, in addition to nuclear and cytosolic RNAseq, whole-cell RNAseq is also performed, then accurate estimations of the localization of transcripts can be obtained. Based on this, we designed a method that estimates, first the fraction of the total RNA volume in the cytosol (nucleus), and then, this fraction for every transcript. We evaluate our methodology on simulated data and nuclear and cytosolic single-cell data available. Finally, we use our method to investigate the subcellular localization of transcripts using bulk RNAseq data from the ENCODE project.
{"title":"Studying relative RNA localization from nucleus to the cytosol.","authors":"Vasilis F Ntasis, Roderic Guigó","doi":"10.1093/nargab/lqaf032","DOIUrl":"10.1093/nargab/lqaf032","url":null,"abstract":"<p><p>The precise coordination of important biological processes, such as differentiation and development, relies heavily on the regulation of gene expression. In eukaryotic cells, understanding the distribution of RNA transcripts between the nucleus and cytosol is essential for gaining valuable insights into the process of protein production. The most efficient way to estimate the levels of RNA species genome-wide is through RNA sequencing (RNAseq). While RNAseq can be performed separately in the nucleus and in the cytosol, comparing transcript levels between compartments is challenging since measurements are relative to the unknown total RNA volume. Here, we show theoretically that if, in addition to nuclear and cytosolic RNAseq, whole-cell RNAseq is also performed, then accurate estimations of the localization of transcripts can be obtained. Based on this, we designed a method that estimates, first the fraction of the total RNA volume in the cytosol (nucleus), and then, this fraction for every transcript. We evaluate our methodology on simulated data and nuclear and cytosolic single-cell data available. Finally, we use our method to investigate the subcellular localization of transcripts using bulk RNAseq data from the ENCODE project.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf032"},"PeriodicalIF":4.0,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204760/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-19eCollection Date: 2025-06-01DOI: 10.1093/nargab/lqaf084
Giulia Cesaro, Giacomo Baruzzo, Gaia Tussardi, Barbara Di Camillo
Single-cell transcriptomics data have been widely used to characterize biological systems, particularly in studying cell-cell communication, which plays a significant role in many biological processes. Despite the availability of various computational tools for inferring cellular communication, quantifying variations across different experimental conditions at both intercellular and intracellular levels remains challenging. Moreover, available methods are in general limited in terms of flexibility in analyzing different experimental designs and the ability to visualize results in an easily interpretable way. Here, we present a generalizable computational framework designed to infer and support differential cellular communication analysis across two experimental conditions from large-scale single-cell transcriptomics data. The scSeqCommDiff tool employs a statistical and network-based computational approach for characterizing altered cellular cross-talk in a fast and memory-efficient way. The framework is complemented with CClens, a user-friendly Shiny app to facilitate interactive analysis of inferred cell-cell communication. Validation through spatial transcriptomics data, comparison with other tools, and application to large-scale datasets (including a cell atlas) confirms the reliability, scalability, and efficiency of the framework. Moreover, the application to a single-nucleus transcriptomics dataset shows the validity and ability of the proposed workflow to support and unravel alterations in cell-cell interactions among patients with amyotrophic lateral sclerosis and healthy subjects.
{"title":"Differential cellular communication inference framework for large-scale single-cell RNA-sequencing data.","authors":"Giulia Cesaro, Giacomo Baruzzo, Gaia Tussardi, Barbara Di Camillo","doi":"10.1093/nargab/lqaf084","DOIUrl":"10.1093/nargab/lqaf084","url":null,"abstract":"<p><p>Single-cell transcriptomics data have been widely used to characterize biological systems, particularly in studying cell-cell communication, which plays a significant role in many biological processes. Despite the availability of various computational tools for inferring cellular communication, quantifying variations across different experimental conditions at both intercellular and intracellular levels remains challenging. Moreover, available methods are in general limited in terms of flexibility in analyzing different experimental designs and the ability to visualize results in an easily interpretable way. Here, we present a generalizable computational framework designed to infer and support differential cellular communication analysis across two experimental conditions from large-scale single-cell transcriptomics data. The scSeqCommDiff tool employs a statistical and network-based computational approach for characterizing altered cellular cross-talk in a fast and memory-efficient way. The framework is complemented with CClens, a user-friendly Shiny app to facilitate interactive analysis of inferred cell-cell communication. Validation through spatial transcriptomics data, comparison with other tools, and application to large-scale datasets (including a cell atlas) confirms the reliability, scalability, and efficiency of the framework. Moreover, the application to a single-nucleus transcriptomics dataset shows the validity and ability of the proposed workflow to support and unravel alterations in cell-cell interactions among patients with amyotrophic lateral sclerosis and healthy subjects.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf084"},"PeriodicalIF":4.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204404/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-19eCollection Date: 2025-06-01DOI: 10.1093/nargab/lqaf090
Stilianos Louca
The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.
{"title":"Machine learning models for delineating marine microbial taxa.","authors":"Stilianos Louca","doi":"10.1093/nargab/lqaf090","DOIUrl":"10.1093/nargab/lqaf090","url":null,"abstract":"<p><p>The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to <i>de novo</i> enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf090"},"PeriodicalIF":4.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204397/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-19eCollection Date: 2025-06-01DOI: 10.1093/nargab/lqaf081
Miroslav Nuriddinov, Polina Belokopytova, Veniamin Fishman
Identifying structural variants (SVs) remains a pivotal challenge within genomic studies. The recent advent of chromosome conformation capture (3C) techniques has emerged as a promising avenue for the accurate identification of SVs. However, development and validation of computational methods leveraging 3C data necessitate comprehensive datasets of well-characterized chromosomal rearrangements, which are presently lacking. In this study, we introduce Charm (https://github.com/genomech/Charm): a robust computational framework tailored for Hi-C data simulation. Our findings demonstrate Charm's efficacy in benchmarking both novel and established tools for SV detection. Additionally, we furnish an extensive dataset of simulated Hi-C maps, paving the way for subsequent benchmarking endeavors.
{"title":"Charm is a flexible pipeline to simulate chromosomal rearrangements on Hi-C-like data.","authors":"Miroslav Nuriddinov, Polina Belokopytova, Veniamin Fishman","doi":"10.1093/nargab/lqaf081","DOIUrl":"10.1093/nargab/lqaf081","url":null,"abstract":"<p><p>Identifying structural variants (SVs) remains a pivotal challenge within genomic studies. The recent advent of chromosome conformation capture (3C) techniques has emerged as a promising avenue for the accurate identification of SVs. However, development and validation of computational methods leveraging 3C data necessitate comprehensive datasets of well-characterized chromosomal rearrangements, which are presently lacking. In this study, we introduce Charm (https://github.com/genomech/Charm): a robust computational framework tailored for Hi-C data simulation. Our findings demonstrate Charm's efficacy in benchmarking both novel and established tools for SV detection. Additionally, we furnish an extensive dataset of simulated Hi-C maps, paving the way for subsequent benchmarking endeavors.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf081"},"PeriodicalIF":4.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}