Xudong Liu, Ying Ni, Lianwei Ye, Zhihao Guo, Lu Tan, Jun Li, Mengsu Yang, Sheng Chen, Runsheng Li
DNA modifications in bacteria present diverse types and distributions, playing crucial functional roles. Current methods for detecting bacterial DNA modifications via nanopore sequencing typically involve comparing raw current signals to a methylation-free control. In this study, we found that bacterial DNA modification induces errors in nanopore reads. And these errors are found only in one strand but not the other, showing a strand-specific bias. Leveraging this discovery, we developed Hammerhead, a pioneering pipeline designed for de novo methylation discovery that circumvents the necessity of raw signal inference and a methylation-free control. The majority (14 out of 16) of the identified motifs can be validated by raw signal comparison methods or by identifying corresponding methyltransferases in bacteria. Additionally, we included a novel polishing strategy employing duplex reads to correct modification-induced errors in bacterial genome assemblies, achieving a reduction of over 85% in such errors. In summary, Hammerhead enables users to effectively locate bacterial DNA methylation sites from nanopore FASTQ/FASTA reads, thus holds promise as a routine pipeline for a wide range of nanopore sequencing applications, such as genome assembly, metagenomic binning, decontaminating eukaryotic genome assemblies, and functional analysis for DNA modifications.
细菌中的 DNA 修饰具有多种类型和分布,发挥着重要的功能作用。目前通过纳米孔测序检测细菌 DNA 修饰的方法通常是将原始电流信号与无甲基化对照进行比较。在这项研究中,我们发现细菌 DNA 修饰会导致纳米孔读数出现错误。而且这些误差只出现在一条链上,而不是另一条链上,这显示了链特异性偏差。利用这一发现,我们开发了 Hammerhead,这是一种用于从头甲基化发现的开创性流水线,它避免了原始信号推断和无甲基化对照的必要性。大部分(16 个中的 14 个)鉴定出的主题可以通过原始信号比较方法或鉴定细菌中相应的甲基转移酶来验证。此外,我们还采用了一种新颖的抛光策略,利用双链读数纠正细菌基因组组装中由修饰引起的错误,减少了 85% 以上的此类错误。总之,Hammerhead 能让用户从纳米孔 FASTQ/FASTA 读数中有效定位细菌 DNA 甲基化位点,因此有望成为基因组组装、元基因组分选、去污真核基因组组装和 DNA 修饰功能分析等各种纳米孔测序应用的常规管道。
{"title":"Nanopore strand-specific mismatch enables de novo detection of bacterial DNA modifications.","authors":"Xudong Liu, Ying Ni, Lianwei Ye, Zhihao Guo, Lu Tan, Jun Li, Mengsu Yang, Sheng Chen, Runsheng Li","doi":"10.1101/gr.279012.124","DOIUrl":"10.1101/gr.279012.124","url":null,"abstract":"<p><p>DNA modifications in bacteria present diverse types and distributions, playing crucial functional roles. Current methods for detecting bacterial DNA modifications via nanopore sequencing typically involve comparing raw current signals to a methylation-free control. In this study, we found that bacterial DNA modification induces errors in nanopore reads. And these errors are found only in one strand but not the other, showing a strand-specific bias. Leveraging this discovery, we developed Hammerhead, a pioneering pipeline designed for de novo methylation discovery that circumvents the necessity of raw signal inference and a methylation-free control. The majority (14 out of 16) of the identified motifs can be validated by raw signal comparison methods or by identifying corresponding methyltransferases in bacteria. Additionally, we included a novel polishing strategy employing duplex reads to correct modification-induced errors in bacterial genome assemblies, achieving a reduction of over 85% in such errors. In summary, Hammerhead enables users to effectively locate bacterial DNA methylation sites from nanopore FASTQ/FASTA reads, thus holds promise as a routine pipeline for a wide range of nanopore sequencing applications, such as genome assembly, metagenomic binning, decontaminating eukaryotic genome assemblies, and functional analysis for DNA modifications.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142365032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautiainen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Olle Nordesjo, Sergey Nurk, Mike Vella, Katherine R Lawrence, Doreen Ware, Michael C Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H J Wittenberg, Adam M Phillippy
The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.
{"title":"Gapless assembly of complete human and plant chromosomes using only nanopore sequencing.","authors":"Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautiainen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Olle Nordesjo, Sergey Nurk, Mike Vella, Katherine R Lawrence, Doreen Ware, Michael C Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H J Wittenberg, Adam M Phillippy","doi":"10.1101/gr.279334.124","DOIUrl":"https://doi.org/10.1101/gr.279334.124","url":null,"abstract":"<p><p>The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, \"telomere-to-telomere\" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT \"Duplex\" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, <i>Solanum lycopersicum</i> Heinz 1706 (tomato), and <i>Zea mays</i> B73 (maize). For the diploid, heterozygous HG002 genome, we also used \"Pore-C\" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142589915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Angela Gomez-Simmonds, Medini K Annavajhala, Dwayne Seeram, Todd W Hokunson, Heekuk Park, Anne-Catrin Uhlemann
Transmission of carbapenem-resistant Enterobacterales (CRE) in hospitals has been shown to occur through complex, multifarious networks driven by both clonal spread and horizontal transfer mediated by plasmids and other mobile genetic elements. We performed nanopore long-read sequencing on CRE isolates from a large urban hospital system to determine the overall contribution of plasmids to CRE transmission and identify specific plasmids implicated in the spread of blaKPC (the Klebsiella pneumoniae carbapenemase [KPC] gene). Six hundred and five CRE isolates collected between 2009 and 2018 first underwent Illumina sequencing for genome-wide genotyping; 435 blaKPC-positive isolates were then successfully nanopore sequenced to generate hybrid assemblies including circularized blaKPC-harboring plasmids. Phylogenetic analysis and Mash clustering were used to define putative clonal and plasmid transmission clusters, respectively. Overall, CRE isolates belonged to 96 multilocus sequence types (STs) encoding blaKPC on 447 plasmids which formed 54 plasmid clusters. We found evidence for clonal transmission in 66% of CRE isolates, over half of which belonged to four clades comprising K. pneumoniae ST258. Plasmid-mediated acquisition of blaKPC occurred in 23%-27% of isolates. While most plasmid clusters were small, several plasmids were identified in multiple different species and STs, including a highly promiscuous IncN plasmid and an IncF plasmid putatively spreading blaKPC from ST258 to other clones. Overall, this points to both the continued dominance of K. pneumoniae ST258 and the dissemination of blaKPC across clones and species by diverse plasmid backbones. These findings support integrating long-read sequencing into genomic surveillance approaches to detect the hitherto silent spread of carbapenem resistance driven by mobile plasmids.
耐碳青霉烯类肠杆菌(CRE)在医院中的传播已被证明是通过由质粒和其他移动遗传因子介导的克隆传播和水平转移所驱动的复杂而多样的网络进行的。我们对来自一个大型城市医院系统的 CRE 分离物进行了纳米孔长读数测序,以确定质粒对 CRE 传播的总体贡献,并识别与 bla KPC(肺炎克雷伯菌碳青霉烯酶 [KPC] 基因)传播有关的特定质粒。2009-2018 年间收集的 605 株 CRE 分离物首先进行了 Illumina 测序,以进行全基因组基因分型;然后对 435 株 bla KPC 阳性分离物进行了成功的纳米孔测序,以生成包括环化 bla KPC 携带质粒的杂交组合。系统发育分析和 Mash 聚类分别用于确定假定的克隆和质粒传播群。总体而言,CRE 分离物属于 96 个多焦点序列类型(ST),在 447 个质粒上编码 bla KPC,形成 54 个质粒群。我们在 66% 的 CRE 分离物中发现了克隆传播的证据,其中一半以上属于由肺炎克菌 ST258 组成的四个支系。23-27%的分离株通过质粒获得了 bla KPC。虽然大多数质粒群规模较小,但在多个不同物种和 ST 中发现了几种质粒,包括一种高度杂合的 IncN 质粒和一种可能将 bla KPC 从 ST258 传播到其他克隆的 IncF 质粒。总之,这表明肺炎克菌 ST258 仍处于优势地位,而 bla KPC 则通过不同的质粒骨架在克隆和物种间传播。这些发现支持将长读测序纳入基因组监测方法,以检测迄今为止由移动质粒驱动的碳青霉烯耐药性的无声传播。
{"title":"Genomic epidemiology of carbapenem-resistant Enterobacterales at a New York City hospital over a 10-year period reveals complex plasmid-clone dynamics and evidence for frequent horizontal transfer of <i>bla</i> <sub>KPC</sub>.","authors":"Angela Gomez-Simmonds, Medini K Annavajhala, Dwayne Seeram, Todd W Hokunson, Heekuk Park, Anne-Catrin Uhlemann","doi":"10.1101/gr.279355.124","DOIUrl":"10.1101/gr.279355.124","url":null,"abstract":"<p><p>Transmission of carbapenem-resistant Enterobacterales (CRE) in hospitals has been shown to occur through complex, multifarious networks driven by both clonal spread and horizontal transfer mediated by plasmids and other mobile genetic elements. We performed nanopore long-read sequencing on CRE isolates from a large urban hospital system to determine the overall contribution of plasmids to CRE transmission and identify specific plasmids implicated in the spread of <i>bla</i> <sub>KPC</sub> (the <i>Klebsiella pneumoniae</i> carbapenemase [KPC] gene). Six hundred and five CRE isolates collected between 2009 and 2018 first underwent Illumina sequencing for genome-wide genotyping; 435 <i>bla</i> <sub>KPC</sub>-positive isolates were then successfully nanopore sequenced to generate hybrid assemblies including circularized <i>bla</i> <sub>KPC</sub>-harboring plasmids. Phylogenetic analysis and Mash clustering were used to define putative clonal and plasmid transmission clusters, respectively. Overall, CRE isolates belonged to 96 multilocus sequence types (STs) encoding <i>bla</i> <sub>KPC</sub> on 447 plasmids which formed 54 plasmid clusters. We found evidence for clonal transmission in 66% of CRE isolates, over half of which belonged to four clades comprising <i>K. pneumoniae</i> ST258. Plasmid-mediated acquisition of <i>bla</i> <sub>KPC</sub> occurred in 23%-27% of isolates. While most plasmid clusters were small, several plasmids were identified in multiple different species and STs, including a highly promiscuous IncN plasmid and an IncF plasmid putatively spreading <i>bla</i> <sub>KPC</sub> from ST258 to other clones. Overall, this points to both the continued dominance of <i>K. pneumoniae</i> ST258 and the dissemination of <i>bla</i> <sub>KPC</sub> across clones and species by diverse plasmid backbones. These findings support integrating long-read sequencing into genomic surveillance approaches to detect the hitherto silent spread of carbapenem resistance driven by mobile plasmids.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142375382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ilya B. Slizovskiy, Nathalie Bonin, Jonathan E. Bravo, Peter M. Ferm, Jacob Singer, Christina Boucher, Noelle R. Noyes
We investigated the efficiency of target-enriched long-read sequencing (TELSeq) for detecting antimicrobial resistance genes (ARGs) and mobile genetic elements (MGEs) within complex matrices. We aimed to overcome limitations associated with traditional antimicrobial resistance (AMR) detection methods, including short-read shotgun metagenomics, which can lack sensitivity, specificity, and the ability to provide detailed genomic context. By combining biotinylated probe-based enrichment with long-read sequencing, we facilitated the amplification and sequencing of ARGs, eliminating the need for bioinformatic reconstruction. Our experimental design included replicates of human fecal microbiota transplant material, bovine feces, pristine prairie soil, and a mock human gut microbial community, allowing us to examine variables including genomic DNA input and probe set composition. Our findings demonstrated that TELSeq markedly improves the detection rates of ARGs and MGEs compared to traditional sequencing methods, underlining its potential for accurate AMR monitoring. A key insight from our research is the importance of incorporating mobilome profiles to better predict the transferability of ARGs within microbial communities, prompting a recommendation for the use of combined ARG–MGE probe sets for future studies. We also reveal limitations for ARG detection from low-input workflows, and describe the next steps for ongoing protocol refinement to minimize technical variability and expand utility in clinical and public health settings. This effort is part of our broader commitment to advancing methodologies that address the global challenge of AMR.
我们研究了目标富集长读数测序(TELSeq)在复杂基质中检测抗菌素耐药性基因(ARGs)和移动遗传因子(MGEs)的效率。我们的目标是克服传统抗菌药耐药性(AMR)检测方法的局限性,包括缺乏灵敏度、特异性和提供详细基因组背景信息能力的短读数猎枪元基因组学。通过将基于生物素化探针的富集与长线程测序相结合,我们促进了 ARGs 的扩增和测序,从而消除了生物信息重建的需要。我们的实验设计包括人类粪便微生物群移植材料、牛粪便、原始草原土壤和模拟人类肠道微生物群落的重复实验,使我们能够研究包括基因组 DNA 输入和探针集组成在内的变量。我们的研究结果表明,与传统测序方法相比,TELSeq 显著提高了 ARGs 和 MGEs 的检出率,凸显了其在准确监测 AMR 方面的潜力。我们的研究得出的一个重要结论是,必须结合移动组图谱来更好地预测 ARGs 在微生物群落中的可转移性,因此建议在未来的研究中使用 ARG-MGE 组合探针集。我们还揭示了低投入工作流程在检测 ARG 方面的局限性,并介绍了下一步如何不断完善方案,以最大限度地减少技术变异,扩大在临床和公共卫生环境中的应用。这项工作是我们更广泛承诺的一部分,我们致力于推进各种方法,以应对 AMR 这一全球性挑战。
{"title":"Factors impacting target-enriched long-read sequencing of resistomes and mobilomes","authors":"Ilya B. Slizovskiy, Nathalie Bonin, Jonathan E. Bravo, Peter M. Ferm, Jacob Singer, Christina Boucher, Noelle R. Noyes","doi":"10.1101/gr.279226.124","DOIUrl":"https://doi.org/10.1101/gr.279226.124","url":null,"abstract":"We investigated the efficiency of target-enriched long-read sequencing (TELSeq) for detecting antimicrobial resistance genes (ARGs) and mobile genetic elements (MGEs) within complex matrices. We aimed to overcome limitations associated with traditional antimicrobial resistance (AMR) detection methods, including short-read shotgun metagenomics, which can lack sensitivity, specificity, and the ability to provide detailed genomic context. By combining biotinylated probe-based enrichment with long-read sequencing, we facilitated the amplification and sequencing of ARGs, eliminating the need for bioinformatic reconstruction. Our experimental design included replicates of human fecal microbiota transplant material, bovine feces, pristine prairie soil, and a mock human gut microbial community, allowing us to examine variables including genomic DNA input and probe set composition. Our findings demonstrated that TELSeq markedly improves the detection rates of ARGs and MGEs compared to traditional sequencing methods, underlining its potential for accurate AMR monitoring. A key insight from our research is the importance of incorporating mobilome profiles to better predict the transferability of ARGs within microbial communities, prompting a recommendation for the use of combined ARG–MGE probe sets for future studies. We also reveal limitations for ARG detection from low-input workflows, and describe the next steps for ongoing protocol refinement to minimize technical variability and expand utility in clinical and public health settings. This effort is part of our broader commitment to advancing methodologies that address the global challenge of AMR.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander J Ritter, Jolene M Draper, Chris Vollmers, Jeremy R Sanford
Alternative splicing (AS) alters the cis-regulatory landscape of mRNA isoforms, leading to transcripts with distinct localization, stability, and translational efficiency. To rigorously investigate mRNA isoform-specific ribosome association, we generated subcellular fractionation and sequencing (Frac-seq) libraries using both conventional short reads and long reads from human embryonic stem cells (ESCs) and neural progenitor cells (NPCs) derived from the same ESCs. We performed de novo transcriptome assembly from high-confidence long reads from cytosolic, monosomal, light, and heavy polyribosomal fractions and quantified their abundance using short reads from their respective subcellular fractions. Thousands of transcripts in each cell type exhibited association with particular subcellular fractions relative to the cytosol. Of the multi-isoform genes, 27% and 19% exhibited significant differential isoform sedimentation in ESCs and NPCs, respectively. Alternative promoter usage and internal exon skipping accounted for the majority of differences between isoforms from the same gene. Random forest classifiers implicated coding sequence (CDS) and untranslated region (UTR) lengths as important determinants of isoform-specific sedimentation profiles, and motif analyses reveal potential cell type-specific and subcellular fraction-associated RNA-binding protein signatures. Taken together, our data demonstrate that alternative mRNA processing within the CDS and UTRs impacts the translational control of mRNA isoforms during stem cell differentiation, and highlight the utility of using a novel long-read sequencing-based method to study translational control.
{"title":"Long-read subcellular fractionation and sequencing reveals the translational fate of full-length mRNA isoforms during neuronal differentiation.","authors":"Alexander J Ritter, Jolene M Draper, Chris Vollmers, Jeremy R Sanford","doi":"10.1101/gr.279170.124","DOIUrl":"10.1101/gr.279170.124","url":null,"abstract":"<p><p>Alternative splicing (AS) alters the <i>cis</i>-regulatory landscape of mRNA isoforms, leading to transcripts with distinct localization, stability, and translational efficiency. To rigorously investigate mRNA isoform-specific ribosome association, we generated subcellular fractionation and sequencing (Frac-seq) libraries using both conventional short reads and long reads from human embryonic stem cells (ESCs) and neural progenitor cells (NPCs) derived from the same ESCs. We performed de novo transcriptome assembly from high-confidence long reads from cytosolic, monosomal, light, and heavy polyribosomal fractions and quantified their abundance using short reads from their respective subcellular fractions. Thousands of transcripts in each cell type exhibited association with particular subcellular fractions relative to the cytosol. Of the multi-isoform genes, 27% and 19% exhibited significant differential isoform sedimentation in ESCs and NPCs, respectively. Alternative promoter usage and internal exon skipping accounted for the majority of differences between isoforms from the same gene. Random forest classifiers implicated coding sequence (CDS) and untranslated region (UTR) lengths as important determinants of isoform-specific sedimentation profiles, and motif analyses reveal potential cell type-specific and subcellular fraction-associated RNA-binding protein signatures. Taken together, our data demonstrate that alternative mRNA processing within the CDS and UTRs impacts the translational control of mRNA isoforms during stem cell differentiation, and highlight the utility of using a novel long-read sequencing-based method to study translational control.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141261622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kristine Bilgrav Saether, Jesper Eisfeldt, Jesse D. Bengtsson, Ming Yin Lun, Christopher M. Grochowski, Medhat Mahmoud, Hsiao-Tuan Chao, Jill A. Rosenfeld, Pengfei Liu, Marlene Ek, Jakob Schuy, Adam Ameur, Hongzheng Dai, Undiagnosed Diseases Network, James Paul Hwang, Fritz J. Sedlazeck, Weimin Bi, Ronit Marom, Josephine Wincent, Ann Nordgren, Claudia M.B. Carvalho, Anna Lindstrand
Chromosomal inversions (INVs) are particularly challenging to detect due to their copy-number neutral state and association with repetitive regions. Inversions represent about 1/20 of all balanced structural chromosome aberrations and can lead to disease by gene disruption or altering regulatory regions of dosage-sensitive genes in cis. Short-read genome sequencing (srGS) can only resolve ∼70% of cytogenetically visible inversions referred to clinical diagnostic laboratories, likely due to breakpoints in repetitive regions. Here, we study 12 inversions by long-read genome sequencing (lrGS) (n = 9) or srGS (n = 3) and resolve nine of them. In four cases, the inversion breakpoint region was missing from at least one of the human reference genomes (GRCh37, GRCh38, T2T-CHM13) and a reference agnostic analysis was needed. One of these cases, an INV9 mappable only in de novo assembled lrGS data using T2T-CHM13 disrupts EHMT1 consistent with a Mendelian diagnosis (Kleefstra syndrome 1; MIM#610253). Next, by pairwise comparison between T2T-CHM13, GRCh37, and GRCh38, as well as the chimpanzee and bonobo, we show that hundreds of megabases of sequence are missing from at least one human reference, highlighting that primate genomes contribute to genomic diversity. Aligning population genomic data to these regions indicated that these regions are variable between individuals. Our analysis emphasizes that T2T-CHM13 is necessary to maximize the value of lrGS for optimal inversion detection in clinical diagnostics. These results highlight the importance of leveraging diverse and comprehensive reference genomes to resolve unsolved molecular cases in rare diseases.
{"title":"Leveraging the T2T assembly to resolve rare and pathogenic inversions in reference genome gaps","authors":"Kristine Bilgrav Saether, Jesper Eisfeldt, Jesse D. Bengtsson, Ming Yin Lun, Christopher M. Grochowski, Medhat Mahmoud, Hsiao-Tuan Chao, Jill A. Rosenfeld, Pengfei Liu, Marlene Ek, Jakob Schuy, Adam Ameur, Hongzheng Dai, Undiagnosed Diseases Network, James Paul Hwang, Fritz J. Sedlazeck, Weimin Bi, Ronit Marom, Josephine Wincent, Ann Nordgren, Claudia M.B. Carvalho, Anna Lindstrand","doi":"10.1101/gr.279346.124","DOIUrl":"https://doi.org/10.1101/gr.279346.124","url":null,"abstract":"Chromosomal inversions (INVs) are particularly challenging to detect due to their copy-number neutral state and association with repetitive regions. Inversions represent about 1/20 of all balanced structural chromosome aberrations and can lead to disease by gene disruption or altering regulatory regions of dosage-sensitive genes in <em>cis</em>. Short-read genome sequencing (srGS) can only resolve ∼70% of cytogenetically visible inversions referred to clinical diagnostic laboratories, likely due to breakpoints in repetitive regions. Here, we study 12 inversions by long-read genome sequencing (lrGS) (<em>n</em> = 9) or srGS (<em>n</em> = 3) and resolve nine of them. In four cases, the inversion breakpoint region was missing from at least one of the human reference genomes (GRCh37, GRCh38, T2T-CHM13) and a reference agnostic analysis was needed. One of these cases, an INV9 mappable only in de novo assembled lrGS data using T2T-CHM13 disrupts <em>EHMT1</em> consistent with a Mendelian diagnosis (Kleefstra syndrome 1; MIM#610253). Next, by pairwise comparison between T2T-CHM13, GRCh37, and GRCh38, as well as the chimpanzee and bonobo, we show that hundreds of megabases of sequence are missing from at least one human reference, highlighting that primate genomes contribute to genomic diversity. Aligning population genomic data to these regions indicated that these regions are variable between individuals. Our analysis emphasizes that T2T-CHM13 is necessary to maximize the value of lrGS for optimal inversion detection in clinical diagnostics. These results highlight the importance of leveraging diverse and comprehensive reference genomes to resolve unsolved molecular cases in rare diseases.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonas A Gustafson, Sophia B Gibson, Nikhita Damaraju, Miranda P G Zalusky, Kendra Hoekzema, David Twesigomwe, Lei Yang, Anthony A Snead, Phillip A Richmond, Wouter De Coster, Nathan D Olson, Andrea Guarracino, Qiuhui Li, Angela L Miller, Joy Goffena, Zachary B Anderson, Sophie H R Storz, Sydney A Ward, Maisha Sinha, Claudia Gonzaga-Jauregui, Wayne E Clarke, Anna O Basile, André Corvelo, Catherine Reeves, Adrienne Helland, Rajeeva Lochan Musunuri, Mahler Revsine, Karynne E Patterson, Cate R Paschal, Christina Zakarian, Sara Goodwin, Tanner D Jensen, Esther Robb, W Richard McCombie, Fritz J Sedlazeck, Justin M Zook, Stephen B Montgomery, Erik Garrison, Mikhail Kolmogorov, Michael C Schatz, Richard N McLaughlin, Harriet Dashnow, Michael C Zody, Matt Loose, Miten Jain, Evan E Eichler, Danny E Miller
Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
{"title":"High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation.","authors":"Jonas A Gustafson, Sophia B Gibson, Nikhita Damaraju, Miranda P G Zalusky, Kendra Hoekzema, David Twesigomwe, Lei Yang, Anthony A Snead, Phillip A Richmond, Wouter De Coster, Nathan D Olson, Andrea Guarracino, Qiuhui Li, Angela L Miller, Joy Goffena, Zachary B Anderson, Sophie H R Storz, Sydney A Ward, Maisha Sinha, Claudia Gonzaga-Jauregui, Wayne E Clarke, Anna O Basile, André Corvelo, Catherine Reeves, Adrienne Helland, Rajeeva Lochan Musunuri, Mahler Revsine, Karynne E Patterson, Cate R Paschal, Christina Zakarian, Sara Goodwin, Tanner D Jensen, Esther Robb, W Richard McCombie, Fritz J Sedlazeck, Justin M Zook, Stephen B Montgomery, Erik Garrison, Mikhail Kolmogorov, Michael C Schatz, Richard N McLaughlin, Harriet Dashnow, Michael C Zody, Matt Loose, Miten Jain, Evan E Eichler, Danny E Miller","doi":"10.1101/gr.279273.124","DOIUrl":"10.1101/gr.279273.124","url":null,"abstract":"<p><p>Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142365031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wouter De Coster, Ida Höijer, Inge Bruggeman, Svenn D'Hert, Malin Melin, Adam Ameur, Rosa Rademakers
The lack of population-scale databases hampers research and diagnostics for medically relevant tandem repeats and repeat expansions. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in a healthy population. The current version includes 1040 individuals of The 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats has been genotyped using STRdust and LongTR to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this data set and the feature to upload one's data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with myotonic dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition and, as such, enable a better assessment of rare tandem repeat alleles observed in patients.
{"title":"Visualization and analysis of medically relevant tandem repeats in nanopore sequencing of control cohorts with pathSTR.","authors":"Wouter De Coster, Ida Höijer, Inge Bruggeman, Svenn D'Hert, Malin Melin, Adam Ameur, Rosa Rademakers","doi":"10.1101/gr.279265.124","DOIUrl":"10.1101/gr.279265.124","url":null,"abstract":"<p><p>The lack of population-scale databases hampers research and diagnostics for medically relevant tandem repeats and repeat expansions. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in a healthy population. The current version includes 1040 individuals of The 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats has been genotyped using STRdust and LongTR to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this data set and the feature to upload one's data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with myotonic dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition and, as such, enable a better assessment of rare tandem repeat alleles observed in patients.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141987779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lovro Vrček, Xavier Bresson, Thomas Laurent, Martin Schmitz, Kenji Kawaguchi, Mile Šikić
The critical stage of every de novo genome assembler is identifying paths in assembly graphs that correspond to the reconstructed genomic sequences. The existing algorithmic methods struggle with this, primarily due to repetitive regions causing complex graph tangles, leading to fragmented assemblies. Here, we introduce GNNome, a framework for path identification based on geometric deep learning that enables training models on assembly graphs without relying on existing assembly strategies. By leveraging only the symmetries inherent to the problem, GNNome reconstructs assemblies from PacBio HiFi reads with contiguity and quality comparable to those of the state-of-the-art tools across several species. With every new genome assembled telomere-to-telomere, the amount of reliable training data at our disposal increases. Combining the straightforward generation of abundant simulated data for diverse genomic structures with the AI approach makes the proposed framework a plausible cornerstone for future work on reconstructing complex genomes with different ploidy and aneuploidy degrees. To facilitate such developments, we make the framework and the best-performing model publicly available, provided as a tool that can directly be used to assemble new haploid genomes.
{"title":"Geometric deep learning framework for de novo genome assembly","authors":"Lovro Vrček, Xavier Bresson, Thomas Laurent, Martin Schmitz, Kenji Kawaguchi, Mile Šikić","doi":"10.1101/gr.279307.124","DOIUrl":"https://doi.org/10.1101/gr.279307.124","url":null,"abstract":"The critical stage of every de novo genome assembler is identifying paths in assembly graphs that correspond to the reconstructed genomic sequences. The existing algorithmic methods struggle with this, primarily due to repetitive regions causing complex graph tangles, leading to fragmented assemblies. Here, we introduce GNNome, a framework for path identification based on geometric deep learning that enables training models on assembly graphs without relying on existing assembly strategies. By leveraging only the symmetries inherent to the problem, GNNome reconstructs assemblies from PacBio HiFi reads with contiguity and quality comparable to those of the state-of-the-art tools across several species. With every new genome assembled telomere-to-telomere, the amount of reliable training data at our disposal increases. Combining the straightforward generation of abundant simulated data for diverse genomic structures with the AI approach makes the proposed framework a plausible cornerstone for future work on reconstructing complex genomes with different ploidy and aneuploidy degrees. To facilitate such developments, we make the framework and the best-performing model publicly available, provided as a tool that can directly be used to assemble new haploid genomes.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142541287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jesper Eisfeldt, Edward J. Higginbotham, Felix Lenner, Jennifer Howe, Bridget A. Fernandez, Anna Lindstrand, Stephen W. Scherer, Lars Feuk
Rare or de novo structural variation, primarily in the form of copy number variants, is detected in 5%–10% of autism spectrum disorder (ASD) families. While complex structural variants involving duplications can generally be detected using microarray or short-read genome sequencing (GS), these methods frequently fail to characterize breakpoints at nucleotide resolution, requiring additional molecular methods for validation and fine-mapping. Here, we use Oxford Nanopore Technologies PromethION long-read GS to characterize complex genomic rearrangements (CGRs) involving large duplications that segregate with ASD in five families. In total, we investigated 13 CGR carriers and were able to resolve all breakpoint junctions at nucleotide resolution. While all breakpoints were identified, the precise genomic architecture of one rearrangement remained unresolved with three different potential structures. The findings in two families include potential fusion genes formed through duplication rearrangements, involving IL1RAPL1–DMD and SUPT16H–CHD8. In two of the families originating from the same geographical region, an identical rearrangement involving ANK2 was identified, which likely represents a founder variant. In addition, we analyze methylation status directly from the long-read data, allowing us to assess the activity of rearranged genes and regulatory regions. Investigation of methylation across the CGRs reveals aberrant methylation status in carriers across a rearrangement affecting the CREBBP locus. In aggregate, our results demonstrate the utility of nanopore sequencing to pinpoint CGRs associated with ASD in five unrelated families, and highlight the importance of a gene-centric description of disease-associated complex chromosomal rearrangements.
{"title":"Resolving complex duplication variants in autism spectrum disorder using long-read genome sequencing","authors":"Jesper Eisfeldt, Edward J. Higginbotham, Felix Lenner, Jennifer Howe, Bridget A. Fernandez, Anna Lindstrand, Stephen W. Scherer, Lars Feuk","doi":"10.1101/gr.279263.124","DOIUrl":"https://doi.org/10.1101/gr.279263.124","url":null,"abstract":"Rare or de novo structural variation, primarily in the form of copy number variants, is detected in 5%–10% of autism spectrum disorder (ASD) families. While complex structural variants involving duplications can generally be detected using microarray or short-read genome sequencing (GS), these methods frequently fail to characterize breakpoints at nucleotide resolution, requiring additional molecular methods for validation and fine-mapping. Here, we use Oxford Nanopore Technologies PromethION long-read GS to characterize complex genomic rearrangements (CGRs) involving large duplications that segregate with ASD in five families. In total, we investigated 13 CGR carriers and were able to resolve all breakpoint junctions at nucleotide resolution. While all breakpoints were identified, the precise genomic architecture of one rearrangement remained unresolved with three different potential structures. The findings in two families include potential fusion genes formed through duplication rearrangements, involving <em>IL1RAPL1–DMD</em> and <em>SUPT16H–CHD8</em>. In two of the families originating from the same geographical region, an identical rearrangement involving <em>ANK2</em> was identified, which likely represents a founder variant. In addition, we analyze methylation status directly from the long-read data, allowing us to assess the activity of rearranged genes and regulatory regions. Investigation of methylation across the CGRs reveals aberrant methylation status in carriers across a rearrangement affecting the <em>CREBBP</em> locus. In aggregate, our results demonstrate the utility of nanopore sequencing to pinpoint CGRs associated with ASD in five unrelated families, and highlight the importance of a gene-centric description of disease-associated complex chromosomal rearrangements.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142541288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}