Shruti V Iyer, Sara Goodwin, William Richard McCombie
Long-read sequencing technologies have improved the contiguity and, as a result, the quality of genome assemblies by generating reads long enough to span and resolve complex or repetitive regions of the genome. Several groups have shown the power of long reads in detecting thousands of genomic and epigenomic features that were previously missed by short-read sequencing approaches. While these studies demonstrate how long reads can help resolve repetitive and complex regions of the genome, they also highlight the throughput and coverage requirements needed to accurately resolve variant alleles across large populations using these platforms. At the time of this review, whole-genome long-read sequencing is more expensive than short-read sequencing on the highest throughput short-read instruments; thus, achieving sufficient coverage to detect low-frequency variants (such as somatic variation) in heterogenous samples remains challenging. Targeted sequencing, on the other hand, provides the depth necessary to detect these low-frequency variants in heterogeneous populations. Here, we review currently used and recently developed targeted sequencing strategies that leverage existing long-read technologies to increase the resolution with which we can look at nucleic acids in a variety of biological contexts.
{"title":"Leveraging the power of long reads for targeted sequencing.","authors":"Shruti V Iyer, Sara Goodwin, William Richard McCombie","doi":"10.1101/gr.279168.124","DOIUrl":"10.1101/gr.279168.124","url":null,"abstract":"<p><p>Long-read sequencing technologies have improved the contiguity and, as a result, the quality of genome assemblies by generating reads long enough to span and resolve complex or repetitive regions of the genome. Several groups have shown the power of long reads in detecting thousands of genomic and epigenomic features that were previously missed by short-read sequencing approaches. While these studies demonstrate how long reads can help resolve repetitive and complex regions of the genome, they also highlight the throughput and coverage requirements needed to accurately resolve variant alleles across large populations using these platforms. At the time of this review, whole-genome long-read sequencing is more expensive than short-read sequencing on the highest throughput short-read instruments; thus, achieving sufficient coverage to detect low-frequency variants (such as somatic variation) in heterogenous samples remains challenging. Targeted sequencing, on the other hand, provides the depth necessary to detect these low-frequency variants in heterogeneous populations. Here, we review currently used and recently developed targeted sequencing strategies that leverage existing long-read technologies to increase the resolution with which we can look at nucleic acids in a variety of biological contexts.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"34 11","pages":"1701-1718"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142681505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonas A Gustafson, Sophia B Gibson, Nikhita Damaraju, Miranda P G Zalusky, Kendra Hoekzema, David Twesigomwe, Lei Yang, Anthony A Snead, Phillip A Richmond, Wouter De Coster, Nathan D Olson, Andrea Guarracino, Qiuhui Li, Angela L Miller, Joy Goffena, Zachary B Anderson, Sophie H R Storz, Sydney A Ward, Maisha Sinha, Claudia Gonzaga-Jauregui, Wayne E Clarke, Anna O Basile, André Corvelo, Catherine Reeves, Adrienne Helland, Rajeeva Lochan Musunuri, Mahler Revsine, Karynne E Patterson, Cate R Paschal, Christina Zakarian, Sara Goodwin, Tanner D Jensen, Esther Robb, William Richard McCombie, Fritz J Sedlazeck, Justin M Zook, Stephen B Montgomery, Erik Garrison, Mikhail Kolmogorov, Michael C Schatz, Richard N McLaughlin, Harriet Dashnow, Michael C Zody, Matt Loose, Miten Jain, Evan E Eichler, Danny E Miller
Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
{"title":"High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation.","authors":"Jonas A Gustafson, Sophia B Gibson, Nikhita Damaraju, Miranda P G Zalusky, Kendra Hoekzema, David Twesigomwe, Lei Yang, Anthony A Snead, Phillip A Richmond, Wouter De Coster, Nathan D Olson, Andrea Guarracino, Qiuhui Li, Angela L Miller, Joy Goffena, Zachary B Anderson, Sophie H R Storz, Sydney A Ward, Maisha Sinha, Claudia Gonzaga-Jauregui, Wayne E Clarke, Anna O Basile, André Corvelo, Catherine Reeves, Adrienne Helland, Rajeeva Lochan Musunuri, Mahler Revsine, Karynne E Patterson, Cate R Paschal, Christina Zakarian, Sara Goodwin, Tanner D Jensen, Esther Robb, William Richard McCombie, Fritz J Sedlazeck, Justin M Zook, Stephen B Montgomery, Erik Garrison, Mikhail Kolmogorov, Michael C Schatz, Richard N McLaughlin, Harriet Dashnow, Michael C Zody, Matt Loose, Miten Jain, Evan E Eichler, Danny E Miller","doi":"10.1101/gr.279273.124","DOIUrl":"10.1101/gr.279273.124","url":null,"abstract":"<p><p>Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"2061-2073"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142365031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Angela Gomez-Simmonds, Medini K Annavajhala, Dwayne Seeram, Todd W Hokunson, Heekuk Park, Anne-Catrin Uhlemann
Transmission of carbapenem-resistant Enterobacterales (CRE) in hospitals has been shown to occur through complex, multifarious networks driven by both clonal spread and horizontal transfer mediated by plasmids and other mobile genetic elements. We performed nanopore long-read sequencing on CRE isolates from a large urban hospital system to determine the overall contribution of plasmids to CRE transmission and identify specific plasmids implicated in the spread of blaKPC (the Klebsiella pneumoniae carbapenemase [KPC] gene). Six hundred and five CRE isolates collected between 2009 and 2018 first underwent Illumina sequencing for genome-wide genotyping; 435 blaKPC-positive isolates were then successfully nanopore sequenced to generate hybrid assemblies including circularized blaKPC-harboring plasmids. Phylogenetic analysis and Mash clustering were used to define putative clonal and plasmid transmission clusters, respectively. Overall, CRE isolates belonged to 96 multilocus sequence types (STs) encoding blaKPC on 447 plasmids which formed 54 plasmid clusters. We found evidence for clonal transmission in 66% of CRE isolates, over half of which belonged to four clades comprising K. pneumoniae ST258. Plasmid-mediated acquisition of blaKPC occurred in 23%-27% of isolates. While most plasmid clusters were small, several plasmids were identified in multiple different species and STs, including a highly promiscuous IncN plasmid and an IncF plasmid putatively spreading blaKPC from ST258 to other clones. Overall, this points to both the continued dominance of K. pneumoniae ST258 and the dissemination of blaKPC across clones and species by diverse plasmid backbones. These findings support integrating long-read sequencing into genomic surveillance approaches to detect the hitherto silent spread of carbapenem resistance driven by mobile plasmids.
耐碳青霉烯类肠杆菌(CRE)在医院中的传播已被证明是通过由质粒和其他移动遗传因子介导的克隆传播和水平转移所驱动的复杂而多样的网络进行的。我们对来自一个大型城市医院系统的 CRE 分离物进行了纳米孔长读数测序,以确定质粒对 CRE 传播的总体贡献,并识别与 bla KPC(肺炎克雷伯菌碳青霉烯酶 [KPC] 基因)传播有关的特定质粒。2009-2018 年间收集的 605 株 CRE 分离物首先进行了 Illumina 测序,以进行全基因组基因分型;然后对 435 株 bla KPC 阳性分离物进行了成功的纳米孔测序,以生成包括环化 bla KPC 携带质粒的杂交组合。系统发育分析和 Mash 聚类分别用于确定假定的克隆和质粒传播群。总体而言,CRE 分离物属于 96 个多焦点序列类型(ST),在 447 个质粒上编码 bla KPC,形成 54 个质粒群。我们在 66% 的 CRE 分离物中发现了克隆传播的证据,其中一半以上属于由肺炎克菌 ST258 组成的四个支系。23-27%的分离株通过质粒获得了 bla KPC。虽然大多数质粒群规模较小,但在多个不同物种和 ST 中发现了几种质粒,包括一种高度杂合的 IncN 质粒和一种可能将 bla KPC 从 ST258 传播到其他克隆的 IncF 质粒。总之,这表明肺炎克菌 ST258 仍处于优势地位,而 bla KPC 则通过不同的质粒骨架在克隆和物种间传播。这些发现支持将长读测序纳入基因组监测方法,以检测迄今为止由移动质粒驱动的碳青霉烯耐药性的无声传播。
{"title":"Genomic epidemiology of carbapenem-resistant Enterobacterales at a New York City hospital over a 10-year period reveals complex plasmid-clone dynamics and evidence for frequent horizontal transfer of <i>bla</i> <sub>KPC</sub>.","authors":"Angela Gomez-Simmonds, Medini K Annavajhala, Dwayne Seeram, Todd W Hokunson, Heekuk Park, Anne-Catrin Uhlemann","doi":"10.1101/gr.279355.124","DOIUrl":"10.1101/gr.279355.124","url":null,"abstract":"<p><p>Transmission of carbapenem-resistant Enterobacterales (CRE) in hospitals has been shown to occur through complex, multifarious networks driven by both clonal spread and horizontal transfer mediated by plasmids and other mobile genetic elements. We performed nanopore long-read sequencing on CRE isolates from a large urban hospital system to determine the overall contribution of plasmids to CRE transmission and identify specific plasmids implicated in the spread of <i>bla</i> <sub>KPC</sub> (the <i>Klebsiella pneumoniae</i> carbapenemase [KPC] gene). Six hundred and five CRE isolates collected between 2009 and 2018 first underwent Illumina sequencing for genome-wide genotyping; 435 <i>bla</i> <sub>KPC</sub>-positive isolates were then successfully nanopore sequenced to generate hybrid assemblies including circularized <i>bla</i> <sub>KPC</sub>-harboring plasmids. Phylogenetic analysis and Mash clustering were used to define putative clonal and plasmid transmission clusters, respectively. Overall, CRE isolates belonged to 96 multilocus sequence types (STs) encoding <i>bla</i> <sub>KPC</sub> on 447 plasmids which formed 54 plasmid clusters. We found evidence for clonal transmission in 66% of CRE isolates, over half of which belonged to four clades comprising <i>K. pneumoniae</i> ST258. Plasmid-mediated acquisition of <i>bla</i> <sub>KPC</sub> occurred in 23%-27% of isolates. While most plasmid clusters were small, several plasmids were identified in multiple different species and STs, including a highly promiscuous IncN plasmid and an IncF plasmid putatively spreading <i>bla</i> <sub>KPC</sub> from ST258 to other clones. Overall, this points to both the continued dominance of <i>K. pneumoniae</i> ST258 and the dissemination of <i>bla</i> <sub>KPC</sub> across clones and species by diverse plasmid backbones. These findings support integrating long-read sequencing into genomic surveillance approaches to detect the hitherto silent spread of carbapenem resistance driven by mobile plasmids.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1895-1907"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610580/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142375382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautiainen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Olle Nordesjo, Sergey Nurk, Mike Vella, Katherine R Lawrence, Doreen Ware, Michael C Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H J Wittenberg, Adam M Phillippy
The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.
{"title":"Gapless assembly of complete human and plant chromosomes using only nanopore sequencing.","authors":"Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautiainen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Olle Nordesjo, Sergey Nurk, Mike Vella, Katherine R Lawrence, Doreen Ware, Michael C Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H J Wittenberg, Adam M Phillippy","doi":"10.1101/gr.279334.124","DOIUrl":"10.1101/gr.279334.124","url":null,"abstract":"<p><p>The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, \"telomere-to-telomere\" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT \"Duplex\" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, <i>Solanum lycopersicum</i> Heinz 1706 (tomato), and <i>Zea mays</i> B73 (maize). For the diploid, heterozygous HG002 genome, we also used \"Pore-C\" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1919-1930"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610574/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142589915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander J Ritter, Jolene M Draper, Christopher Vollmers, Jeremy R Sanford
Alternative splicing (AS) alters the cis-regulatory landscape of mRNA isoforms, leading to transcripts with distinct localization, stability, and translational efficiency. To rigorously investigate mRNA isoform-specific ribosome association, we generated subcellular fractionation and sequencing (Frac-seq) libraries using both conventional short reads and long reads from human embryonic stem cells (ESCs) and neural progenitor cells (NPCs) derived from the same ESCs. We performed de novo transcriptome assembly from high-confidence long reads from cytosolic, monosomal, light, and heavy polyribosomal fractions and quantified their abundance using short reads from their respective subcellular fractions. Thousands of transcripts in each cell type exhibited association with particular subcellular fractions relative to the cytosol. Of the multi-isoform genes, 27% and 19% exhibited significant differential isoform sedimentation in ESCs and NPCs, respectively. Alternative promoter usage and internal exon skipping accounted for the majority of differences between isoforms from the same gene. Random forest classifiers implicated coding sequence (CDS) and untranslated region (UTR) lengths as important determinants of isoform-specific sedimentation profiles, and motif analyses reveal potential cell type-specific and subcellular fraction-associated RNA-binding protein signatures. Taken together, our data demonstrate that alternative mRNA processing within the CDS and UTRs impacts the translational control of mRNA isoforms during stem cell differentiation, and highlight the utility of using a novel long-read sequencing-based method to study translational control.
{"title":"Long-read subcellular fractionation and sequencing reveals the translational fate of full-length mRNA isoforms during neuronal differentiation.","authors":"Alexander J Ritter, Jolene M Draper, Christopher Vollmers, Jeremy R Sanford","doi":"10.1101/gr.279170.124","DOIUrl":"10.1101/gr.279170.124","url":null,"abstract":"<p><p>Alternative splicing (AS) alters the <i>cis</i>-regulatory landscape of mRNA isoforms, leading to transcripts with distinct localization, stability, and translational efficiency. To rigorously investigate mRNA isoform-specific ribosome association, we generated subcellular fractionation and sequencing (Frac-seq) libraries using both conventional short reads and long reads from human embryonic stem cells (ESCs) and neural progenitor cells (NPCs) derived from the same ESCs. We performed de novo transcriptome assembly from high-confidence long reads from cytosolic, monosomal, light, and heavy polyribosomal fractions and quantified their abundance using short reads from their respective subcellular fractions. Thousands of transcripts in each cell type exhibited association with particular subcellular fractions relative to the cytosol. Of the multi-isoform genes, 27% and 19% exhibited significant differential isoform sedimentation in ESCs and NPCs, respectively. Alternative promoter usage and internal exon skipping accounted for the majority of differences between isoforms from the same gene. Random forest classifiers implicated coding sequence (CDS) and untranslated region (UTR) lengths as important determinants of isoform-specific sedimentation profiles, and motif analyses reveal potential cell type-specific and subcellular fraction-associated RNA-binding protein signatures. Taken together, our data demonstrate that alternative mRNA processing within the CDS and UTRs impacts the translational control of mRNA isoforms during stem cell differentiation, and highlight the utility of using a novel long-read sequencing-based method to study translational control.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"2000-2011"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141261622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paige A Byerly, Alina von Thaden, Evgeny Leushkin, Leon Hilgers, Shenglin Liu, Sven Winter, Tilman Schell, Charlotte Gerheim, Alexander Ben Hamadou, Carola Greve, Christian Betz, Hanno J Bolz, Sven Büchner, Johannes Lang, Holger Meinig, Evax Marie Famira-Parcsetich, Sarah P Stubbe, Alice Mouton, Sandro Bertolino, Goedele Verbeylen, Thomas Briner, Lídia Freixas, Lorenzo Vinciguerra, Sarah A Mueller, Carsten Nowak, Michael Hiller
Genomic resources are important for evaluating genetic diversity and supporting conservation efforts. The garden dormouse (Eliomys quercinus) is a small rodent that has experienced one of the most severe modern population declines in Europe. We present a high-quality haplotype-resolved reference genome for the garden dormouse, and combine comprehensive short and long-read transcriptomics data sets with homology-based methods to generate a highly complete gene annotation. Demographic history analysis of the genome reveal a sharp population decline since the last interglacial, indicating an association between colder climates and population declines before anthropogenic influence. Using our genome and genetic data from 100 individuals, largely sampled in a citizen-science project across the contemporary range, we conduct the first population genomic analysis for this species. We find clear evidence for population structure across the species' core Central European range. Notably, our data show that the Alpine population, characterized by strong differentiation and reduced genetic diversity, is reproductively isolated from other regions and likely represents a differentiated evolutionary significant unit (ESU). The predominantly declining Eastern European populations also show signs of recent isolation, a pattern consistent with a range expansion from Western to Eastern Europe during the Holocene, leaving relict populations now facing local extinction. Overall, our findings suggest that garden dormouse conservation may be enhanced in Europe through the designation of ESUs.
{"title":"Haplotype-resolved genome and population genomics of the threatened garden dormouse in Europe.","authors":"Paige A Byerly, Alina von Thaden, Evgeny Leushkin, Leon Hilgers, Shenglin Liu, Sven Winter, Tilman Schell, Charlotte Gerheim, Alexander Ben Hamadou, Carola Greve, Christian Betz, Hanno J Bolz, Sven Büchner, Johannes Lang, Holger Meinig, Evax Marie Famira-Parcsetich, Sarah P Stubbe, Alice Mouton, Sandro Bertolino, Goedele Verbeylen, Thomas Briner, Lídia Freixas, Lorenzo Vinciguerra, Sarah A Mueller, Carsten Nowak, Michael Hiller","doi":"10.1101/gr.279066.124","DOIUrl":"10.1101/gr.279066.124","url":null,"abstract":"<p><p>Genomic resources are important for evaluating genetic diversity and supporting conservation efforts. The garden dormouse (<i>Eliomys quercinus</i>) is a small rodent that has experienced one of the most severe modern population declines in Europe. We present a high-quality haplotype-resolved reference genome for the garden dormouse, and combine comprehensive short and long-read transcriptomics data sets with homology-based methods to generate a highly complete gene annotation. Demographic history analysis of the genome reveal a sharp population decline since the last interglacial, indicating an association between colder climates and population declines before anthropogenic influence. Using our genome and genetic data from 100 individuals, largely sampled in a citizen-science project across the contemporary range, we conduct the first population genomic analysis for this species. We find clear evidence for population structure across the species' core Central European range. Notably, our data show that the Alpine population, characterized by strong differentiation and reduced genetic diversity, is reproductively isolated from other regions and likely represents a differentiated evolutionary significant unit (ESU). The predominantly declining Eastern European populations also show signs of recent isolation, a pattern consistent with a range expansion from Western to Eastern Europe during the Holocene, leaving relict populations now facing local extinction. Overall, our findings suggest that garden dormouse conservation may be enhanced in Europe through the designation of ESUs.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"2094-2107"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610594/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142618653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ezequiel Calvo-Roitberg, Rachel F Daniels, Athma A Pai
Long-read sequencing (LRS) technologies have the potential to revolutionize scientific discoveries in RNA biology through the comprehensive identification and quantification of full-length mRNA isoforms. Despite great promise, challenges remain in the widespread implementation of LRS technologies for RNA-based applications, including concerns about low coverage, high sequencing error, and robust computational pipelines. Although much focus has been placed on defining mRNA exon composition and structure with LRS data, less careful characterization has been done of the ability to assess the terminal ends of isoforms, specifically, transcription start and end sites. Such characterization is crucial for completely delineating full mRNA molecules and regulatory consequences. However, there are substantial inconsistencies in both start and end coordinates of LRS reads spanning a gene, such that LRS reads often fail to accurately recapitulate annotated or empirically derived terminal ends of mRNA molecules. Here, we describe the specific challenges of identifying and quantifying mRNA terminal ends with LRS technologies and how these issues influence biological interpretations of LRS data. We then review recent experimental and computational advances designed to alleviate these problems, with ideal use cases for each approach. Finally, we outline anticipated developments and necessary improvements for the characterization of terminal ends from LRS data.
{"title":"Challenges in identifying mRNA transcript starts and ends from long-read sequencing data.","authors":"Ezequiel Calvo-Roitberg, Rachel F Daniels, Athma A Pai","doi":"10.1101/gr.279559.124","DOIUrl":"10.1101/gr.279559.124","url":null,"abstract":"<p><p>Long-read sequencing (LRS) technologies have the potential to revolutionize scientific discoveries in RNA biology through the comprehensive identification and quantification of full-length mRNA isoforms. Despite great promise, challenges remain in the widespread implementation of LRS technologies for RNA-based applications, including concerns about low coverage, high sequencing error, and robust computational pipelines. Although much focus has been placed on defining mRNA exon composition and structure with LRS data, less careful characterization has been done of the ability to assess the terminal ends of isoforms, specifically, transcription start and end sites. Such characterization is crucial for completely delineating full mRNA molecules and regulatory consequences. However, there are substantial inconsistencies in both start and end coordinates of LRS reads spanning a gene, such that LRS reads often fail to accurately recapitulate annotated or empirically derived terminal ends of mRNA molecules. Here, we describe the specific challenges of identifying and quantifying mRNA terminal ends with LRS technologies and how these issues influence biological interpretations of LRS data. We then review recent experimental and computational advances designed to alleviate these problems, with ideal use cases for each approach. Finally, we outline anticipated developments and necessary improvements for the characterization of terminal ends from LRS data.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"34 11","pages":"1719-1734"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142681504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RNA isoform diversity, produced via alternative splicing, and alternative usage of transcription start and poly(A) sites, results in varied transcripts being derived from the same gene. Distinct isoforms can play important biological roles, including by changing the sequences or expression levels of protein products. The first single-cell approaches to RNA sequencing-and later, spatial approaches-which are now widely used for the identification of differentially expressed genes, rely on short reads and offer the ability to transcriptomically compare different cell types but are limited in their ability to measure differential isoform expression. More recently, long-read sequencing methods have been combined with single-cell and spatial technologies in order to characterize isoform expression. In this review, we provide an overview of the emergence of single-cell and spatial long-read sequencing and discuss the challenges associated with the implementation of these technologies and interpretation of these data. We discuss the opportunities they offer for understanding the relationships between the distinct variable elements of transcript molecules and highlight some of the ways in which they have been used to characterize isoforms' roles in development and pathology. Single-nucleus long-read sequencing, a special case of the single-cell approach, is also discussed. We attempt to cover both the limitations of these technologies and their significant potential for expanding our still-limited understanding of the biological roles of RNA isoforms.
{"title":"Understanding isoform expression by pairing long-read sequencing with single-cell and spatial transcriptomics.","authors":"Natan Belchikov, Justine Hsu, Xiang Jennie Li, Julien Jarroux, Wen Hu, Anoushka Joglekar, Hagen U Tilgner","doi":"10.1101/gr.279640.124","DOIUrl":"10.1101/gr.279640.124","url":null,"abstract":"<p><p>RNA isoform diversity, produced via alternative splicing, and alternative usage of transcription start and poly(A) sites, results in varied transcripts being derived from the same gene. Distinct isoforms can play important biological roles, including by changing the sequences or expression levels of protein products. The first single-cell approaches to RNA sequencing-and later, spatial approaches-which are now widely used for the identification of differentially expressed genes, rely on short reads and offer the ability to transcriptomically compare different cell types but are limited in their ability to measure differential isoform expression. More recently, long-read sequencing methods have been combined with single-cell and spatial technologies in order to characterize isoform expression. In this review, we provide an overview of the emergence of single-cell and spatial long-read sequencing and discuss the challenges associated with the implementation of these technologies and interpretation of these data. We discuss the opportunities they offer for understanding the relationships between the distinct variable elements of transcript molecules and highlight some of the ways in which they have been used to characterize isoforms' roles in development and pathology. Single-nucleus long-read sequencing, a special case of the single-cell approach, is also discussed. We attempt to cover both the limitations of these technologies and their significant potential for expanding our still-limited understanding of the biological roles of RNA isoforms.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"34 11","pages":"1735-1746"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610585/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142681553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anupama Jha, Stephanie C Bohaczuk, Yizi Mao, Jane Ranchalis, Benjamin J Mallory, Alan T Min, Morgan O Hamm, Elliott Swanson, Danilo Dubocanin, Connor Finkbeiner, Tony Li, Dale Whittington, William Stafford Noble, Andrew B Stergachis, Mitchell R Vollger
Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation and the identification of exogenously placed DNA N6 -methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce fibertools, a state-of-the-art toolkit that features a semisupervised convolutional neural network for fast and accurate identification of m6A-marked bases using Pacific Biosciences (PacBio) single-molecule long-read sequencing, as well as the coprocessing of long-read genetic and epigenetic data produced using either the PacBio or Oxford Nanopore Technologies (ONT) sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kb long DNA molecules with an ∼1000-fold improvement in speed. In addition, we demonstrate that fibertools can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.
{"title":"DNA-m6A calling and integrated long-read epigenetic and genetic analysis with <i>fibertools</i>.","authors":"Anupama Jha, Stephanie C Bohaczuk, Yizi Mao, Jane Ranchalis, Benjamin J Mallory, Alan T Min, Morgan O Hamm, Elliott Swanson, Danilo Dubocanin, Connor Finkbeiner, Tony Li, Dale Whittington, William Stafford Noble, Andrew B Stergachis, Mitchell R Vollger","doi":"10.1101/gr.279095.124","DOIUrl":"10.1101/gr.279095.124","url":null,"abstract":"<p><p>Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation and the identification of exogenously placed DNA <i>N</i> <sup><i>6</i></sup> -methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce <i>fibertools</i>, a state-of-the-art toolkit that features a semisupervised convolutional neural network for fast and accurate identification of m6A-marked bases using Pacific Biosciences (PacBio) single-molecule long-read sequencing, as well as the coprocessing of long-read genetic and epigenetic data produced using either the PacBio or Oxford Nanopore Technologies (ONT) sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kb long DNA molecules with an ∼1000-fold improvement in speed. In addition, we demonstrate that <i>fibertools</i> can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1976-1986"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610455/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141288010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wouter De Coster, Ida Höijer, Inge Bruggeman, Svenn D'Hert, Malin Melin, Adam Ameur, Rosa Rademakers
The lack of population-scale databases hampers research and diagnostics for medically relevant tandem repeats and repeat expansions. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in a healthy population. The current version includes 1040 individuals of The 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats has been genotyped using STRdust and LongTR to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this data set and the feature to upload one's data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with myotonic dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition and, as such, enable a better assessment of rare tandem repeat alleles observed in patients.
{"title":"Visualization and analysis of medically relevant tandem repeats in nanopore sequencing of control cohorts with pathSTR.","authors":"Wouter De Coster, Ida Höijer, Inge Bruggeman, Svenn D'Hert, Malin Melin, Adam Ameur, Rosa Rademakers","doi":"10.1101/gr.279265.124","DOIUrl":"10.1101/gr.279265.124","url":null,"abstract":"<p><p>The lack of population-scale databases hampers research and diagnostics for medically relevant tandem repeats and repeat expansions. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in a healthy population. The current version includes 1040 individuals of The 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats has been genotyped using STRdust and LongTR to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this data set and the feature to upload one's data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with myotonic dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition and, as such, enable a better assessment of rare tandem repeat alleles observed in patients.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"2074-2080"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610575/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141987779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}