Basanta Bista, Laura González-Rodelas, Lucía Álvarez-González, Zhi-qiang Wu, Eugenia E. Montiel, Ling Sze Lee, Daleen B. Badenhorst, Srihari Radhakrishnan, Robert Literman, Beatriz Navarro-Dominguez, John B. Iverson, Simon Orozco-Arias, Josefa González, Aurora Ruiz-Herrera, Nicole Valenzuela
Understanding the evolution of chromatin conformation among species is fundamental to elucidate the architecture and plasticity of genomes. Nonrandom interactions of linearly distant loci regulate gene function in species-specific patterns, affecting genome function, evolution, and, ultimately, speciation. Yet, data from nonmodel organisms are scarce. To capture the macroevolutionary diversity of vertebrate chromatin conformation, here we generate de novo genome assemblies for two cryptodiran (hidden-neck) turtles via Illumina sequencing, chromosome conformation capture, and RNA-seq: Apalone spinifera (ZZ/ZW, 2n = 66) and Staurotypus triporcatus (XX/XY, 2n = 54). We detected differences in the three-dimensional (3D) chromatin structure in turtles compared to other amniotes beyond the fusion/fission events detected in the linear genomes. Namely, whole-genome comparisons revealed distinct trends of chromosome rearrangements in turtles: (1) a low rate of genome reshuffling in Apalone (Trionychidae) whose karyotype is highly conserved when compared to chicken (likely ancestral for turtles), and (2) a moderate rate of fusions/fissions in Staurotypus (Kinosternidae) and Trachemys scripta (Emydidae). Furthermore, we identified a chromosome folding pattern that enables “centromere–telomere interactions” previously undetected in turtles. The combined turtle pattern of “centromere–telomere interactions” (discovered here) plus “centromere clustering” (previously reported in sauropsids) is novel for amniotes and it counters previous hypotheses about amniote 3D chromatin structure. We hypothesize that the divergent pattern found in turtles originated from an amniote ancestral state defined by a nuclear configuration with extensive associations among microchromosomes that were preserved upon the reshuffling of the linear genome.
{"title":"De novo genome assemblies of two cryptodiran turtles with ZZ/ZW and XX/XY sex chromosomes provide insights into patterns of genome reshuffling and uncover novel 3D genome folding in amniotes","authors":"Basanta Bista, Laura González-Rodelas, Lucía Álvarez-González, Zhi-qiang Wu, Eugenia E. Montiel, Ling Sze Lee, Daleen B. Badenhorst, Srihari Radhakrishnan, Robert Literman, Beatriz Navarro-Dominguez, John B. Iverson, Simon Orozco-Arias, Josefa González, Aurora Ruiz-Herrera, Nicole Valenzuela","doi":"10.1101/gr.279443.124","DOIUrl":"https://doi.org/10.1101/gr.279443.124","url":null,"abstract":"Understanding the evolution of chromatin conformation among species is fundamental to elucidate the architecture and plasticity of genomes. Nonrandom interactions of linearly distant loci regulate gene function in species-specific patterns, affecting genome function, evolution, and, ultimately, speciation. Yet, data from nonmodel organisms are scarce. To capture the macroevolutionary diversity of vertebrate chromatin conformation, here we generate de novo genome assemblies for two cryptodiran (hidden-neck) turtles via Illumina sequencing, chromosome conformation capture, and RNA-seq: <em>Apalone spinifera</em> (ZZ/ZW, 2<em>n</em> = 66) and <em>Staurotypus triporcatus</em> (XX/XY, 2<em>n</em> = 54). We detected differences in the three-dimensional (3D) chromatin structure in turtles compared to other amniotes beyond the fusion/fission events detected in the linear genomes. Namely, whole-genome comparisons revealed distinct trends of chromosome rearrangements in turtles: (1) a low rate of genome reshuffling in <em>Apalone</em> (Trionychidae) whose karyotype is highly conserved when compared to chicken (likely ancestral for turtles), and (2) a moderate rate of fusions/fissions in <em>Staurotypus</em> (Kinosternidae) and <em>Trachemys scripta</em> (Emydidae). Furthermore, we identified a chromosome folding pattern that enables “centromere–telomere interactions” previously undetected in turtles. The combined turtle pattern of “centromere–telomere interactions” (discovered here) plus “centromere clustering” (previously reported in sauropsids) is novel for amniotes and it counters previous hypotheses about amniote 3D chromatin structure. We hypothesize that the divergent pattern found in turtles originated from an amniote ancestral state defined by a nuclear configuration with extensive associations among microchromosomes that were preserved upon the reshuffling of the linear genome.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"32 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guy Kelman, Roei Zucker, Nadav Brandes, Michal Linial
PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes’ biochemical functions. Secondly, for each individual, aggregating the variants per gene determines a gene-damaging score. Finally, standard statistical tests are activated in the case-control setting to yield statistically significant genes per phenotype. The PWAS Hub offers a user-friendly interface for an in-depth exploration of gene–disease associations from the UK Biobank (UKB). Results from PWAS cover 99 common diseases and conditions, each with over 10,000 diagnosed individuals per phenotype. Users can explore genes associated with these diseases, with separate analyses conducted for males and females. For each phenotype, the analyses account for sex-based genetic effects, inheritance modes (dominant and recessive), and the pleiotropic nature of associated genes. The PWAS Hub showcases its usefulness for asthma by navigating through proteomic-genetic analyses. Inspecting PWAS asthma-listed genes (a total of 27) provide insights into the underlying cellular and molecular mechanisms. Comparison of PWAS-statistically significant genes for common diseases to the Open Targets benchmark shows partial but significant overlap in gene associations for most phenotypes. Graphical tools facilitate comparing genetic effects between PWAS and coding GWAS results, aiding in understanding the sex-specific genetic impact on common diseases. This adaptable platform is attractive to clinicians, researchers, and individuals interested in delving into gene–disease associations and sex-specific genetic effects.
{"title":"PWAS Hub for exploring gene-based associations of common complex diseases","authors":"Guy Kelman, Roei Zucker, Nadav Brandes, Michal Linial","doi":"10.1101/gr.278916.123","DOIUrl":"https://doi.org/10.1101/gr.278916.123","url":null,"abstract":"PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes’ biochemical functions. Secondly, for each individual, aggregating the variants per gene determines a gene-damaging score. Finally, standard statistical tests are activated in the case-control setting to yield statistically significant genes per phenotype. The PWAS Hub offers a user-friendly interface for an in-depth exploration of gene–disease associations from the UK Biobank (UKB). Results from PWAS cover 99 common diseases and conditions, each with over 10,000 diagnosed individuals per phenotype. Users can explore genes associated with these diseases, with separate analyses conducted for males and females. For each phenotype, the analyses account for sex-based genetic effects, inheritance modes (dominant and recessive), and the pleiotropic nature of associated genes. The PWAS Hub showcases its usefulness for asthma by navigating through proteomic-genetic analyses. Inspecting PWAS asthma-listed genes (a total of 27) provide insights into the underlying cellular and molecular mechanisms. Comparison of PWAS-statistically significant genes for common diseases to the Open Targets benchmark shows partial but significant overlap in gene associations for most phenotypes. Graphical tools facilitate comparing genetic effects between PWAS and coding GWAS results, aiding in understanding the sex-specific genetic impact on common diseases. This adaptable platform is attractive to clinicians, researchers, and individuals interested in delving into gene–disease associations and sex-specific genetic effects.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"18 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees
Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split k-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.
{"title":"Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis","authors":"Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees","doi":"10.1101/gr.279449.124","DOIUrl":"https://doi.org/10.1101/gr.279449.124","url":null,"abstract":"Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split <em>k</em>-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"78 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.
{"title":"Telomere-to-telomere assembly by preserving contained reads","authors":"Sudhanva Shyam Kamath, Mehak Bindra, Debnath Pal, Chirag Jain","doi":"10.1101/gr.279311.124","DOIUrl":"https://doi.org/10.1101/gr.279311.124","url":null,"abstract":"Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"10 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the P-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.
{"title":"Theoretical framework for the difference of two negative binomial distributions and its application in comparative analysis of sequencing data","authors":"Alicia Petrany, Ruoyu Chen, Shaoqiang Zhang, Yong Chen","doi":"10.1101/gr.278843.123","DOIUrl":"https://doi.org/10.1101/gr.278843.123","url":null,"abstract":"High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the <em>P</em>-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"57 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Somatic mutations arise and accumulate during tissue culture and vegetative propagation, potentially affecting various traits in horticultural crops, but their characteristics are still unclear. Here, somatic mutations in regenerated woodland strawberry derived from tissue culture of shoot tips under different conditions and 12 cultivated strawberry individuals are analyzed by whole genome sequencing. The mutation frequency of single nucleotide variants is significantly increased with increased hormone levels or prolonged culture time in the range of 3.3 × 10−8–3.0 × 10−6 mutations per site. CG methylation shows a stable reduction (0.71%–8.03%) in regenerated plants, and hypoCG-DMRs are more heritable after sexual reproduction. A high-quality haplotype-resolved genome is assembled for the strawberry cultivar “Beni hoppe.” The 12 “Beni hoppe” individuals randomly selected from different locations show 4731–6005 mutations relative to the reference genome, and the mutation frequency varies among the subgenomes. Our study has systematically characterized the genetic and epigenetic variants in regenerated woodland strawberry plants and different individuals of the same strawberry cultivar, providing an accurate assessment of somatic mutations at the genomic scale and nucleotide resolution in plants.
{"title":"Global characterization of somatic mutations and DNA methylation changes during vegetative propagation in strawberries","authors":"Shaoqiang Hu, Xiangguo Zeng, Yuguo Liu, Yongping Li, Minghao Qu, Wen-Biao Jiao, Yongchao Han, Chunying Kang","doi":"10.1101/gr.279378.124","DOIUrl":"https://doi.org/10.1101/gr.279378.124","url":null,"abstract":"Somatic mutations arise and accumulate during tissue culture and vegetative propagation, potentially affecting various traits in horticultural crops, but their characteristics are still unclear. Here, somatic mutations in regenerated woodland strawberry derived from tissue culture of shoot tips under different conditions and 12 cultivated strawberry individuals are analyzed by whole genome sequencing. The mutation frequency of single nucleotide variants is significantly increased with increased hormone levels or prolonged culture time in the range of 3.3 × 10<sup>−8</sup>–3.0 × 10<sup>−6</sup> mutations per site. CG methylation shows a stable reduction (0.71%–8.03%) in regenerated plants, and hypoCG-DMRs are more heritable after sexual reproduction. A high-quality haplotype-resolved genome is assembled for the strawberry cultivar “Beni hoppe.” The 12 “Beni hoppe” individuals randomly selected from different locations show 4731–6005 mutations relative to the reference genome, and the mutation frequency varies among the subgenomes. Our study has systematically characterized the genetic and epigenetic variants in regenerated woodland strawberry plants and different individuals of the same strawberry cultivar, providing an accurate assessment of somatic mutations at the genomic scale and nucleotide resolution in plants.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"24 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis E. Valentin-Alvarado, Ling-Dong Shi, Kathryn E. Appler, Alexander Crits-Christoph, Valerie De Anda, Benjamin A. Adler, Michael L. Cui, Lynn Ly, Pedro Leão, Richard J. Roberts, Rohan Sachdeva, Brett J. Baker, David F. Savage, Jillian F. Banfield
Asgard archaea are of great interest as the progenitors of Eukaryotes, but little is known about the mobile genetic elements (MGEs) that may shape their ongoing evolution. Here, we describe MGEs that replicate in Atabeyarchaeia, a wetland Asgard archaea lineage represented by two complete genomes. We used soil depth–resolved population metagenomic data sets to track 18 MGEs for which genome structures were defined and precise chromosome integration sites could be identified for confident host linkage. Additionally, we identified a complete 20.67 kbp circular plasmid and two family-level groups of viruses linked to Atabeyarchaeia, via CRISPR spacer targeting. Closely related 40 kbp viruses possess a hypervariable genomic region encoding combinations of specific genes for small cysteine-rich proteins structurally similar to restriction-homing endonucleases. One 10.9 kbp integrative conjugative element (ICE) integrates genomically into the Atabeyarchaeum deiterrae-1 chromosome and has a 2.5 kbp circularizable element integrated within it. The 10.9 kbp ICE encodes an expressed Type IIG restriction-modification system with a sequence specificity matching an active methylation motif identified by Pacific Biosciences (PacBio) high-accuracy long-read (HiFi) metagenomic sequencing. Restriction-modification of Atabeyarchaeia differs from that of another coexisting Asgard archaea, Freyarchaeia, which has few identified MGEs but possesses diverse defense mechanisms, including DISARM and Hachiman, not found in Atabeyarchaeia. Overall, defense systems and methylation mechanisms of Asgard archaea likely modulate their interactions with MGEs, and integration/excision and copy number variation of MGEs in turn enable host genetic versatility.
{"title":"Complete genomes of Asgard archaea reveal diverse integrated and mobile genetic elements","authors":"Luis E. Valentin-Alvarado, Ling-Dong Shi, Kathryn E. Appler, Alexander Crits-Christoph, Valerie De Anda, Benjamin A. Adler, Michael L. Cui, Lynn Ly, Pedro Leão, Richard J. Roberts, Rohan Sachdeva, Brett J. Baker, David F. Savage, Jillian F. Banfield","doi":"10.1101/gr.279480.124","DOIUrl":"https://doi.org/10.1101/gr.279480.124","url":null,"abstract":"Asgard archaea are of great interest as the progenitors of Eukaryotes, but little is known about the mobile genetic elements (MGEs) that may shape their ongoing evolution. Here, we describe MGEs that replicate in Atabeyarchaeia, a wetland Asgard archaea lineage represented by two complete genomes. We used soil depth–resolved population metagenomic data sets to track 18 MGEs for which genome structures were defined and precise chromosome integration sites could be identified for confident host linkage. Additionally, we identified a complete 20.67 kbp circular plasmid and two family-level groups of viruses linked to Atabeyarchaeia, via CRISPR spacer targeting. Closely related 40 kbp viruses possess a hypervariable genomic region encoding combinations of specific genes for small cysteine-rich proteins structurally similar to restriction-homing endonucleases. One 10.9 kbp integrative conjugative element (ICE) integrates genomically into the <em>Atabeyarchaeum deiterrae-1</em> chromosome and has a 2.5 kbp circularizable element integrated within it. The 10.9 kbp ICE encodes an expressed Type IIG restriction-modification system with a sequence specificity matching an active methylation motif identified by Pacific Biosciences (PacBio) high-accuracy long-read (HiFi) metagenomic sequencing. Restriction-modification of Atabeyarchaeia differs from that of another coexisting Asgard archaea, Freyarchaeia, which has few identified MGEs but possesses diverse defense mechanisms, including DISARM and Hachiman, not found in Atabeyarchaeia. Overall, defense systems and methylation mechanisms of Asgard archaea likely modulate their interactions with MGEs, and integration/excision and copy number variation of MGEs in turn enable host genetic versatility.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"54 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niccolo Tesi, Alex Salazar, Yaran Zhang, Sven van der Lee, Marc Hulsman, Lydian Knoop, Sanduni Wijesekera, Jana Krizova, Anne-Fleur Schneider, Maartje Pennings, Kristel Sleegers, Erik-Jan Kamsteeg, Marcel Reinders, Henne Holstege
Tandem repeats (TR) play important roles in genomic variation and disease risk in humans. Long-read sequencing allows for the accurate characterization of TRs, however, the underlying bioinformatics perspectives remain challenging. We present otter and TREAT: otter is a fast targeted local assembler, cross-compatible across different sequencing platforms. It is integrated in TREAT, an end-to-end workflow for TR characterization, visualization and analysis across multiple genomes. In a comparison with existing tools based on long-read sequencing data from both Oxford Nanopore Technology (ONT, Simplex and Duplex) and PacBio (Sequel 2 and Revio), otter and TREAT achieved state-of-the-art genotyping and motif characterisation accuracy. Applied to clinically relevant TRs, TREAT/otter significantly identified individuals with pathogenic TR expansions. When applied to a case-control setting, we significantly replicated previously reported associations of TRs with Alzheimer's Disease, including those near or within APOC1 (p=2.63x10-9), SPI1 (p=6.5x10-3) and ABCA7 (p=0.04) genes. We used TREAT/otter to systematically evaluate potential biases when genotyping TRs using diverse ONT and PacBio long-read sequencing datasets. We showed that, in rare cases (0.06%), long-read sequencing suffers from coverage drops in TRs, including the disease-associated TRs in ABCA7 and RFC1 genes. Such coverage drops can lead to TR misgenotyping, hampering the accurate characterization of TR alleles. Taken together, our tools can accurately genotype TR across different sequencing technologies and with minimal requirements, allowing end-to-end analysis and comparisons of TR in human genomes, with broad applications in research and clinical fields.
{"title":"Characterising tandem repeat complexities across long-read sequencing platforms with TREAT and otter","authors":"Niccolo Tesi, Alex Salazar, Yaran Zhang, Sven van der Lee, Marc Hulsman, Lydian Knoop, Sanduni Wijesekera, Jana Krizova, Anne-Fleur Schneider, Maartje Pennings, Kristel Sleegers, Erik-Jan Kamsteeg, Marcel Reinders, Henne Holstege","doi":"10.1101/gr.279351.124","DOIUrl":"https://doi.org/10.1101/gr.279351.124","url":null,"abstract":"Tandem repeats (TR) play important roles in genomic variation and disease risk in humans. Long-read sequencing allows for the accurate characterization of TRs, however, the underlying bioinformatics perspectives remain challenging. We present otter and TREAT: otter is a fast targeted local assembler, cross-compatible across different sequencing platforms. It is integrated in TREAT, an end-to-end workflow for TR characterization, visualization and analysis across multiple genomes. In a comparison with existing tools based on long-read sequencing data from both Oxford Nanopore Technology (ONT, Simplex and Duplex) and PacBio (Sequel 2 and Revio), otter and TREAT achieved state-of-the-art genotyping and motif characterisation accuracy. Applied to clinically relevant TRs, TREAT/otter significantly identified individuals with pathogenic TR expansions. When applied to a case-control setting, we significantly replicated previously reported associations of TRs with Alzheimer's Disease, including those near or within <em>APOC1</em> (p=2.63x10-9), <em>SPI1</em> (p=6.5x10-3) and <em>ABCA7</em> (p=0.04) genes. We used TREAT/otter to systematically evaluate potential biases when genotyping TRs using diverse ONT and PacBio long-read sequencing datasets. We showed that, in rare cases (0.06%), long-read sequencing suffers from coverage drops in TRs, including the disease-associated TRs in <em>ABCA7</em> and <em>RFC1</em> genes. Such coverage drops can lead to TR misgenotyping, hampering the accurate characterization of TR alleles. Taken together, our tools can accurately genotype TR across different sequencing technologies and with minimal requirements, allowing end-to-end analysis and comparisons of TR in human genomes, with broad applications in research and clinical fields.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"59 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, Carl Kingsford
Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps. First, we generate data with more diverse modification combinations through in silico cross-linking. Second, we use this dataset to train an end-to-end neural network basecaller followed by fine-tuning on immunoprecipitation-based experimental data with label-smoothing. The trained neural network basecaller outperforms existing methylation detection methods on both read-level and site-level prediction scores. Xron is a standalone, end-to-end m6A-distinguishing basecaller capable of detecting methylated bases directly from raw sequencing signals, enabling de novo methylome assembly.
{"title":"Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework","authors":"Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, Carl Kingsford","doi":"10.1101/gr.278960.124","DOIUrl":"https://doi.org/10.1101/gr.278960.124","url":null,"abstract":"Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps. First, we generate data with more diverse modification combinations through in silico cross-linking. Second, we use this dataset to train an end-to-end neural network basecaller followed by fine-tuning on immunoprecipitation-based experimental data with label-smoothing. The trained neural network basecaller outperforms existing methylation detection methods on both read-level and site-level prediction scores. Xron is a standalone, end-to-end m6A-distinguishing basecaller capable of detecting methylated bases directly from raw sequencing signals, enabling de novo methylome assembly.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"1 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaiyuan Zhu, Matthew G Jones, Jens Luebeck, Xinxin Bu, Hyerim Yi, King L Hung, Ivy Tsz-Lo Wong, Shu Zhang, Paul S Mischel, Howard Y Chang, Vineet Bafna
Extrachromosomal DNA (ecDNA) is a central mechanism for focal oncogene amplification in cancer, occurring in ∼15% of early-stage cancers and ∼30% of late-stage cancers. ecDNAs drive tumor formation, evolution, and drug resistance by dynamically modulating oncogene copy number and rewiring gene-regulatory networks. Elucidating the genomic architecture of ecDNA amplifications is critical for understanding tumor pathology and developing more effective therapies. Paired-end short-read (Illumina) sequencing and mapping have been utilized to represent ecDNA amplifications using a breakpoint graph, in which the inferred architecture of ecDNA is encoded as a cycle in the graph. Traversals of breakpoint graphs have been used to successfully predict ecDNA presence in cancer samples. However, short-read technologies are intrinsically limited in the identification of breakpoints, phasing together complex rearrangements and internal duplications, and deconvolution of cell-to-cell heterogeneity of ecDNA structures. Long-read technologies, such as from Oxford Nanopore Technologies, have the potential to improve inference as the longer reads are better at mapping structural variants and are more likely to span rearranged or duplicated regions. Here, we propose Complete Reconstruction of Amplifications with Long reads (CoRAL) for reconstructing ecDNA architectures using long-read data. CoRAL reconstructs likely cyclic architectures using quadratic programming that simultaneously optimizes parsimony of reconstruction, explained copy number, and consistency of long-read mapping. CoRAL substantially improves reconstructions in extensive simulations and 10 data sets from previously characterized cell lines compared with previous short- and long-read-based tools. As long-read usage becomes widespread, we anticipate that CoRAL will be a valuable tool for profiling the landscape and evolution of focal amplifications in tumors.
{"title":"CoRAL accurately resolves extrachromosomal DNA genome structures with long-read sequencing.","authors":"Kaiyuan Zhu, Matthew G Jones, Jens Luebeck, Xinxin Bu, Hyerim Yi, King L Hung, Ivy Tsz-Lo Wong, Shu Zhang, Paul S Mischel, Howard Y Chang, Vineet Bafna","doi":"10.1101/gr.279131.124","DOIUrl":"10.1101/gr.279131.124","url":null,"abstract":"<p><p>Extrachromosomal DNA (ecDNA) is a central mechanism for focal oncogene amplification in cancer, occurring in ∼15% of early-stage cancers and ∼30% of late-stage cancers. ecDNAs drive tumor formation, evolution, and drug resistance by dynamically modulating oncogene copy number and rewiring gene-regulatory networks. Elucidating the genomic architecture of ecDNA amplifications is critical for understanding tumor pathology and developing more effective therapies. Paired-end short-read (Illumina) sequencing and mapping have been utilized to represent ecDNA amplifications using a breakpoint graph, in which the inferred architecture of ecDNA is encoded as a cycle in the graph. Traversals of breakpoint graphs have been used to successfully predict ecDNA presence in cancer samples. However, short-read technologies are intrinsically limited in the identification of breakpoints, phasing together complex rearrangements and internal duplications, and deconvolution of cell-to-cell heterogeneity of ecDNA structures. Long-read technologies, such as from Oxford Nanopore Technologies, have the potential to improve inference as the longer reads are better at mapping structural variants and are more likely to span rearranged or duplicated regions. Here, we propose Complete Reconstruction of Amplifications with Long reads (CoRAL) for reconstructing ecDNA architectures using long-read data. CoRAL reconstructs likely cyclic architectures using quadratic programming that simultaneously optimizes parsimony of reconstruction, explained copy number, and consistency of long-read mapping. CoRAL substantially improves reconstructions in extensive simulations and 10 data sets from previously characterized cell lines compared with previous short- and long-read-based tools. As long-read usage becomes widespread, we anticipate that CoRAL will be a valuable tool for profiling the landscape and evolution of focal amplifications in tumors.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1344-1354"},"PeriodicalIF":6.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529860/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141563231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}