首页 > 最新文献

Genome research最新文献

英文 中文
Nanopore strand-specific mismatch enables de novo detection of bacterial DNA modifications. 纳米孔链条特异性错配可实现对细菌 DNA 修饰的从头检测。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-07 DOI: 10.1101/gr.279012.124
Xudong Liu, Ying Ni, Lianwei Ye, Zhihao Guo, Lu Tan, Jun Li, Mengsu Yang, Sheng Chen, Runsheng Li

DNA modifications in bacteria present diverse types and distributions, playing crucial functional roles. Current methods for detecting bacterial DNA modifications via nanopore sequencing typically involve comparing raw current signals to a methylation-free control. In this study, we found that bacterial DNA modification induces errors in nanopore reads. And these errors are found only in one strand but not the other, showing a strand-specific bias. Leveraging this discovery, we developed Hammerhead, a pioneering pipeline designed for de novo methylation discovery that circumvents the necessity of raw signal inference and a methylation-free control. The majority (14 out of 16) of the identified motifs can be validated by raw signal comparison methods or by identifying corresponding methyltransferases in bacteria. Additionally, we included a novel polishing strategy employing duplex reads to correct modification-induced errors in bacterial genome assemblies, achieving a reduction of over 85% in such errors. In summary, Hammerhead enables users to effectively locate bacterial DNA methylation sites from nanopore FASTQ/FASTA reads, thus holds promise as a routine pipeline for a wide range of nanopore sequencing applications, such as genome assembly, metagenomic binning, decontaminating eukaryotic genome assemblies, and functional analysis for DNA modifications.

细菌中的 DNA 修饰具有多种类型和分布,发挥着重要的功能作用。目前通过纳米孔测序检测细菌 DNA 修饰的方法通常是将原始电流信号与无甲基化对照进行比较。在这项研究中,我们发现细菌 DNA 修饰会导致纳米孔读数出现错误。而且这些误差只出现在一条链上,而不是另一条链上,这显示了链特异性偏差。利用这一发现,我们开发了 Hammerhead,这是一种用于从头甲基化发现的开创性流水线,它避免了原始信号推断和无甲基化对照的必要性。大部分(16 个中的 14 个)鉴定出的主题可以通过原始信号比较方法或鉴定细菌中相应的甲基转移酶来验证。此外,我们还采用了一种新颖的抛光策略,利用双链读数纠正细菌基因组组装中由修饰引起的错误,减少了 85% 以上的此类错误。总之,Hammerhead 能让用户从纳米孔 FASTQ/FASTA 读数中有效定位细菌 DNA 甲基化位点,因此有望成为基因组组装、元基因组分选、去污真核基因组组装和 DNA 修饰功能分析等各种纳米孔测序应用的常规管道。
{"title":"Nanopore strand-specific mismatch enables de novo detection of bacterial DNA modifications.","authors":"Xudong Liu, Ying Ni, Lianwei Ye, Zhihao Guo, Lu Tan, Jun Li, Mengsu Yang, Sheng Chen, Runsheng Li","doi":"10.1101/gr.279012.124","DOIUrl":"10.1101/gr.279012.124","url":null,"abstract":"<p><p>DNA modifications in bacteria present diverse types and distributions, playing crucial functional roles. Current methods for detecting bacterial DNA modifications via nanopore sequencing typically involve comparing raw current signals to a methylation-free control. In this study, we found that bacterial DNA modification induces errors in nanopore reads. And these errors are found only in one strand but not the other, showing a strand-specific bias. Leveraging this discovery, we developed Hammerhead, a pioneering pipeline designed for de novo methylation discovery that circumvents the necessity of raw signal inference and a methylation-free control. The majority (14 out of 16) of the identified motifs can be validated by raw signal comparison methods or by identifying corresponding methyltransferases in bacteria. Additionally, we included a novel polishing strategy employing duplex reads to correct modification-induced errors in bacterial genome assemblies, achieving a reduction of over 85% in such errors. In summary, Hammerhead enables users to effectively locate bacterial DNA methylation sites from nanopore FASTQ/FASTA reads, thus holds promise as a routine pipeline for a wide range of nanopore sequencing applications, such as genome assembly, metagenomic binning, decontaminating eukaryotic genome assemblies, and functional analysis for DNA modifications.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142365032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gapless assembly of complete human and plant chromosomes using only nanopore sequencing. 仅使用纳米孔测序技术无间隙组装完整的人类和植物染色体。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-06 DOI: 10.1101/gr.279334.124
Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautiainen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Olle Nordesjo, Sergey Nurk, Mike Vella, Katherine R Lawrence, Doreen Ware, Michael C Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H J Wittenberg, Adam M Phillippy

The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.

牛津纳米孔技术公司(ONT)的超长(UL)测序读数与太平洋生物科学公司(PacBio)的长而精确的高保真(HiFi)读数相结合,完成了人类基因组,并推动了完成许多其他物种基因组的类似工作。然而,这种 "端粒到端粒 "的完整基因组组装方法依赖于多个测序平台,限制了其可及性。ONT "双链 "测序读数可同时读取DNA的两条链以提高质量,并保证每个碱基的高准确性。为了评估这一新的数据类型,我们为三个被广泛研究的基因组生成了 ONT 双重数据:人类 HG002、Solanum lycopersicum Heinz 1706(番茄)和 Zea mays B73(玉米)。对于二倍体、杂合子 HG002 基因组,我们还使用了 "Pore-C "染色质接触图谱来完全分期单倍型。我们发现,Duplex 数据的准确性与 HiFi 测序相似,但读数长度长几十个千碱基,而 Pore-C 数据与现有的二倍体组装算法兼容。读数长度和准确性的结合使我们能够构建高质量的初始组装,然后利用 UL 读数进一步解析,最后利用 Pore-C 分阶段形成染色体规模的单倍型。最终的组装结果具有超过 99.999% 的碱基准确率(Q50)和近乎完美的连续性,大多数染色体组装为单个等位基因。我们的结论是,ONT 测序是全新基因组组装中 HiFi 测序的可行替代方案,并为重建完整基因组提供了多轮单仪器解决方案。
{"title":"Gapless assembly of complete human and plant chromosomes using only nanopore sequencing.","authors":"Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautiainen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Olle Nordesjo, Sergey Nurk, Mike Vella, Katherine R Lawrence, Doreen Ware, Michael C Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H J Wittenberg, Adam M Phillippy","doi":"10.1101/gr.279334.124","DOIUrl":"https://doi.org/10.1101/gr.279334.124","url":null,"abstract":"<p><p>The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, \"telomere-to-telomere\" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT \"Duplex\" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, <i>Solanum lycopersicum</i> Heinz 1706 (tomato), and <i>Zea mays</i> B73 (maize). For the diploid, heterozygous HG002 genome, we also used \"Pore-C\" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142589915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genomic epidemiology of carbapenem-resistant Enterobacterales at a New York City hospital over a 10-year period reveals complex plasmid-clone dynamics and evidence for frequent horizontal transfer of bla KPC. 纽约市一家医院十年间耐碳青霉烯类肠杆菌的基因组流行病学揭示了复杂的质粒克隆动态和 bla KPC 频繁水平转移的证据。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-05 DOI: 10.1101/gr.279355.124
Angela Gomez-Simmonds, Medini K Annavajhala, Dwayne Seeram, Todd W Hokunson, Heekuk Park, Anne-Catrin Uhlemann

Transmission of carbapenem-resistant Enterobacterales (CRE) in hospitals has been shown to occur through complex, multifarious networks driven by both clonal spread and horizontal transfer mediated by plasmids and other mobile genetic elements. We performed nanopore long-read sequencing on CRE isolates from a large urban hospital system to determine the overall contribution of plasmids to CRE transmission and identify specific plasmids implicated in the spread of bla KPC (the Klebsiella pneumoniae carbapenemase [KPC] gene). Six hundred and five CRE isolates collected between 2009 and 2018 first underwent Illumina sequencing for genome-wide genotyping; 435 bla KPC-positive isolates were then successfully nanopore sequenced to generate hybrid assemblies including circularized bla KPC-harboring plasmids. Phylogenetic analysis and Mash clustering were used to define putative clonal and plasmid transmission clusters, respectively. Overall, CRE isolates belonged to 96 multilocus sequence types (STs) encoding bla KPC on 447 plasmids which formed 54 plasmid clusters. We found evidence for clonal transmission in 66% of CRE isolates, over half of which belonged to four clades comprising K. pneumoniae ST258. Plasmid-mediated acquisition of bla KPC occurred in 23%-27% of isolates. While most plasmid clusters were small, several plasmids were identified in multiple different species and STs, including a highly promiscuous IncN plasmid and an IncF plasmid putatively spreading bla KPC from ST258 to other clones. Overall, this points to both the continued dominance of K. pneumoniae ST258 and the dissemination of bla KPC across clones and species by diverse plasmid backbones. These findings support integrating long-read sequencing into genomic surveillance approaches to detect the hitherto silent spread of carbapenem resistance driven by mobile plasmids.

耐碳青霉烯类肠杆菌(CRE)在医院中的传播已被证明是通过由质粒和其他移动遗传因子介导的克隆传播和水平转移所驱动的复杂而多样的网络进行的。我们对来自一个大型城市医院系统的 CRE 分离物进行了纳米孔长读数测序,以确定质粒对 CRE 传播的总体贡献,并识别与 bla KPC(肺炎克雷伯菌碳青霉烯酶 [KPC] 基因)传播有关的特定质粒。2009-2018 年间收集的 605 株 CRE 分离物首先进行了 Illumina 测序,以进行全基因组基因分型;然后对 435 株 bla KPC 阳性分离物进行了成功的纳米孔测序,以生成包括环化 bla KPC 携带质粒的杂交组合。系统发育分析和 Mash 聚类分别用于确定假定的克隆和质粒传播群。总体而言,CRE 分离物属于 96 个多焦点序列类型(ST),在 447 个质粒上编码 bla KPC,形成 54 个质粒群。我们在 66% 的 CRE 分离物中发现了克隆传播的证据,其中一半以上属于由肺炎克菌 ST258 组成的四个支系。23-27%的分离株通过质粒获得了 bla KPC。虽然大多数质粒群规模较小,但在多个不同物种和 ST 中发现了几种质粒,包括一种高度杂合的 IncN 质粒和一种可能将 bla KPC 从 ST258 传播到其他克隆的 IncF 质粒。总之,这表明肺炎克菌 ST258 仍处于优势地位,而 bla KPC 则通过不同的质粒骨架在克隆和物种间传播。这些发现支持将长读测序纳入基因组监测方法,以检测迄今为止由移动质粒驱动的碳青霉烯耐药性的无声传播。
{"title":"Genomic epidemiology of carbapenem-resistant Enterobacterales at a New York City hospital over a 10-year period reveals complex plasmid-clone dynamics and evidence for frequent horizontal transfer of <i>bla</i> <sub>KPC</sub>.","authors":"Angela Gomez-Simmonds, Medini K Annavajhala, Dwayne Seeram, Todd W Hokunson, Heekuk Park, Anne-Catrin Uhlemann","doi":"10.1101/gr.279355.124","DOIUrl":"10.1101/gr.279355.124","url":null,"abstract":"<p><p>Transmission of carbapenem-resistant Enterobacterales (CRE) in hospitals has been shown to occur through complex, multifarious networks driven by both clonal spread and horizontal transfer mediated by plasmids and other mobile genetic elements. We performed nanopore long-read sequencing on CRE isolates from a large urban hospital system to determine the overall contribution of plasmids to CRE transmission and identify specific plasmids implicated in the spread of <i>bla</i> <sub>KPC</sub> (the <i>Klebsiella pneumoniae</i> carbapenemase [KPC] gene). Six hundred and five CRE isolates collected between 2009 and 2018 first underwent Illumina sequencing for genome-wide genotyping; 435 <i>bla</i> <sub>KPC</sub>-positive isolates were then successfully nanopore sequenced to generate hybrid assemblies including circularized <i>bla</i> <sub>KPC</sub>-harboring plasmids. Phylogenetic analysis and Mash clustering were used to define putative clonal and plasmid transmission clusters, respectively. Overall, CRE isolates belonged to 96 multilocus sequence types (STs) encoding <i>bla</i> <sub>KPC</sub> on 447 plasmids which formed 54 plasmid clusters. We found evidence for clonal transmission in 66% of CRE isolates, over half of which belonged to four clades comprising <i>K. pneumoniae</i> ST258. Plasmid-mediated acquisition of <i>bla</i> <sub>KPC</sub> occurred in 23%-27% of isolates. While most plasmid clusters were small, several plasmids were identified in multiple different species and STs, including a highly promiscuous IncN plasmid and an IncF plasmid putatively spreading <i>bla</i> <sub>KPC</sub> from ST258 to other clones. Overall, this points to both the continued dominance of <i>K. pneumoniae</i> ST258 and the dissemination of <i>bla</i> <sub>KPC</sub> across clones and species by diverse plasmid backbones. These findings support integrating long-read sequencing into genomic surveillance approaches to detect the hitherto silent spread of carbapenem resistance driven by mobile plasmids.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142375382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Factors impacting target-enriched long-read sequencing of resistomes and mobilomes 影响抗性基因组和动员基因组目标富集长读程测序的因素
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-05 DOI: 10.1101/gr.279226.124
Ilya B. Slizovskiy, Nathalie Bonin, Jonathan E. Bravo, Peter M. Ferm, Jacob Singer, Christina Boucher, Noelle R. Noyes
We investigated the efficiency of target-enriched long-read sequencing (TELSeq) for detecting antimicrobial resistance genes (ARGs) and mobile genetic elements (MGEs) within complex matrices. We aimed to overcome limitations associated with traditional antimicrobial resistance (AMR) detection methods, including short-read shotgun metagenomics, which can lack sensitivity, specificity, and the ability to provide detailed genomic context. By combining biotinylated probe-based enrichment with long-read sequencing, we facilitated the amplification and sequencing of ARGs, eliminating the need for bioinformatic reconstruction. Our experimental design included replicates of human fecal microbiota transplant material, bovine feces, pristine prairie soil, and a mock human gut microbial community, allowing us to examine variables including genomic DNA input and probe set composition. Our findings demonstrated that TELSeq markedly improves the detection rates of ARGs and MGEs compared to traditional sequencing methods, underlining its potential for accurate AMR monitoring. A key insight from our research is the importance of incorporating mobilome profiles to better predict the transferability of ARGs within microbial communities, prompting a recommendation for the use of combined ARG–MGE probe sets for future studies. We also reveal limitations for ARG detection from low-input workflows, and describe the next steps for ongoing protocol refinement to minimize technical variability and expand utility in clinical and public health settings. This effort is part of our broader commitment to advancing methodologies that address the global challenge of AMR.
我们研究了目标富集长读数测序(TELSeq)在复杂基质中检测抗菌素耐药性基因(ARGs)和移动遗传因子(MGEs)的效率。我们的目标是克服传统抗菌药耐药性(AMR)检测方法的局限性,包括缺乏灵敏度、特异性和提供详细基因组背景信息能力的短读数猎枪元基因组学。通过将基于生物素化探针的富集与长线程测序相结合,我们促进了 ARGs 的扩增和测序,从而消除了生物信息重建的需要。我们的实验设计包括人类粪便微生物群移植材料、牛粪便、原始草原土壤和模拟人类肠道微生物群落的重复实验,使我们能够研究包括基因组 DNA 输入和探针集组成在内的变量。我们的研究结果表明,与传统测序方法相比,TELSeq 显著提高了 ARGs 和 MGEs 的检出率,凸显了其在准确监测 AMR 方面的潜力。我们的研究得出的一个重要结论是,必须结合移动组图谱来更好地预测 ARGs 在微生物群落中的可转移性,因此建议在未来的研究中使用 ARG-MGE 组合探针集。我们还揭示了低投入工作流程在检测 ARG 方面的局限性,并介绍了下一步如何不断完善方案,以最大限度地减少技术变异,扩大在临床和公共卫生环境中的应用。这项工作是我们更广泛承诺的一部分,我们致力于推进各种方法,以应对 AMR 这一全球性挑战。
{"title":"Factors impacting target-enriched long-read sequencing of resistomes and mobilomes","authors":"Ilya B. Slizovskiy, Nathalie Bonin, Jonathan E. Bravo, Peter M. Ferm, Jacob Singer, Christina Boucher, Noelle R. Noyes","doi":"10.1101/gr.279226.124","DOIUrl":"https://doi.org/10.1101/gr.279226.124","url":null,"abstract":"We investigated the efficiency of target-enriched long-read sequencing (TELSeq) for detecting antimicrobial resistance genes (ARGs) and mobile genetic elements (MGEs) within complex matrices. We aimed to overcome limitations associated with traditional antimicrobial resistance (AMR) detection methods, including short-read shotgun metagenomics, which can lack sensitivity, specificity, and the ability to provide detailed genomic context. By combining biotinylated probe-based enrichment with long-read sequencing, we facilitated the amplification and sequencing of ARGs, eliminating the need for bioinformatic reconstruction. Our experimental design included replicates of human fecal microbiota transplant material, bovine feces, pristine prairie soil, and a mock human gut microbial community, allowing us to examine variables including genomic DNA input and probe set composition. Our findings demonstrated that TELSeq markedly improves the detection rates of ARGs and MGEs compared to traditional sequencing methods, underlining its potential for accurate AMR monitoring. A key insight from our research is the importance of incorporating mobilome profiles to better predict the transferability of ARGs within microbial communities, prompting a recommendation for the use of combined ARG–MGE probe sets for future studies. We also reveal limitations for ARG detection from low-input workflows, and describe the next steps for ongoing protocol refinement to minimize technical variability and expand utility in clinical and public health settings. This effort is part of our broader commitment to advancing methodologies that address the global challenge of AMR.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Long-read subcellular fractionation and sequencing reveals the translational fate of full-length mRNA isoforms during neuronal differentiation. 长读数亚细胞分馏和测序揭示了全长 mRNA 同工型在神经元分化过程中的翻译命运。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-05 DOI: 10.1101/gr.279170.124
Alexander J Ritter, Jolene M Draper, Chris Vollmers, Jeremy R Sanford

Alternative splicing (AS) alters the cis-regulatory landscape of mRNA isoforms, leading to transcripts with distinct localization, stability, and translational efficiency. To rigorously investigate mRNA isoform-specific ribosome association, we generated subcellular fractionation and sequencing (Frac-seq) libraries using both conventional short reads and long reads from human embryonic stem cells (ESCs) and neural progenitor cells (NPCs) derived from the same ESCs. We performed de novo transcriptome assembly from high-confidence long reads from cytosolic, monosomal, light, and heavy polyribosomal fractions and quantified their abundance using short reads from their respective subcellular fractions. Thousands of transcripts in each cell type exhibited association with particular subcellular fractions relative to the cytosol. Of the multi-isoform genes, 27% and 19% exhibited significant differential isoform sedimentation in ESCs and NPCs, respectively. Alternative promoter usage and internal exon skipping accounted for the majority of differences between isoforms from the same gene. Random forest classifiers implicated coding sequence (CDS) and untranslated region (UTR) lengths as important determinants of isoform-specific sedimentation profiles, and motif analyses reveal potential cell type-specific and subcellular fraction-associated RNA-binding protein signatures. Taken together, our data demonstrate that alternative mRNA processing within the CDS and UTRs impacts the translational control of mRNA isoforms during stem cell differentiation, and highlight the utility of using a novel long-read sequencing-based method to study translational control.

替代剪接(AS)改变了 mRNA 同工型的顺式调控结构,导致转录本具有不同的定位、稳定性和翻译效率。为了严格研究mRNA异构体特异性核糖体关联,我们使用传统的短读数和长读数生成了亚细胞分馏和测序(Frac-seq)文库,这些文库来自人类胚胎干细胞(ESC)和来自同一ESC的神经祖细胞(NPC)。我们利用来自细胞质、单体、轻型和重型多核糖体组分的高置信度长读数进行了从头转录组组装,并利用来自各自亚细胞组分的短读数量化了它们的丰度。与细胞质相比,每种细胞类型中都有数千个转录本与特定亚细胞组分相关。在多同工酶基因中,分别有 27% 和 19% 的基因在 ESC 和 NPC 中表现出明显的同工酶沉积差异。启动子的交替使用和内部外显子的跳转是造成同一基因不同异构体之间差异的主要原因。随机森林分类器表明,编码序列(CDS)和UTR长度是决定同工酶特异性沉降谱的重要因素,而主题分析揭示了潜在的细胞类型特异性和亚细胞组分相关的RNA结合蛋白特征。总之,我们的数据证明了在干细胞分化过程中,CDS和UTR内的mRNA替代加工影响了mRNA异构体的翻译控制,并突出了使用基于长读数测序的新型方法研究翻译控制的实用性。
{"title":"Long-read subcellular fractionation and sequencing reveals the translational fate of full-length mRNA isoforms during neuronal differentiation.","authors":"Alexander J Ritter, Jolene M Draper, Chris Vollmers, Jeremy R Sanford","doi":"10.1101/gr.279170.124","DOIUrl":"10.1101/gr.279170.124","url":null,"abstract":"<p><p>Alternative splicing (AS) alters the <i>cis</i>-regulatory landscape of mRNA isoforms, leading to transcripts with distinct localization, stability, and translational efficiency. To rigorously investigate mRNA isoform-specific ribosome association, we generated subcellular fractionation and sequencing (Frac-seq) libraries using both conventional short reads and long reads from human embryonic stem cells (ESCs) and neural progenitor cells (NPCs) derived from the same ESCs. We performed de novo transcriptome assembly from high-confidence long reads from cytosolic, monosomal, light, and heavy polyribosomal fractions and quantified their abundance using short reads from their respective subcellular fractions. Thousands of transcripts in each cell type exhibited association with particular subcellular fractions relative to the cytosol. Of the multi-isoform genes, 27% and 19% exhibited significant differential isoform sedimentation in ESCs and NPCs, respectively. Alternative promoter usage and internal exon skipping accounted for the majority of differences between isoforms from the same gene. Random forest classifiers implicated coding sequence (CDS) and untranslated region (UTR) lengths as important determinants of isoform-specific sedimentation profiles, and motif analyses reveal potential cell type-specific and subcellular fraction-associated RNA-binding protein signatures. Taken together, our data demonstrate that alternative mRNA processing within the CDS and UTRs impacts the translational control of mRNA isoforms during stem cell differentiation, and highlight the utility of using a novel long-read sequencing-based method to study translational control.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141261622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging the T2T assembly to resolve rare and pathogenic inversions in reference genome gaps 利用 T2T 组装解决参考基因组间隙中的罕见和致病倒位问题
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-01 DOI: 10.1101/gr.279346.124
Kristine Bilgrav Saether, Jesper Eisfeldt, Jesse D. Bengtsson, Ming Yin Lun, Christopher M. Grochowski, Medhat Mahmoud, Hsiao-Tuan Chao, Jill A. Rosenfeld, Pengfei Liu, Marlene Ek, Jakob Schuy, Adam Ameur, Hongzheng Dai, Undiagnosed Diseases Network, James Paul Hwang, Fritz J. Sedlazeck, Weimin Bi, Ronit Marom, Josephine Wincent, Ann Nordgren, Claudia M.B. Carvalho, Anna Lindstrand
Chromosomal inversions (INVs) are particularly challenging to detect due to their copy-number neutral state and association with repetitive regions. Inversions represent about 1/20 of all balanced structural chromosome aberrations and can lead to disease by gene disruption or altering regulatory regions of dosage-sensitive genes in cis. Short-read genome sequencing (srGS) can only resolve ∼70% of cytogenetically visible inversions referred to clinical diagnostic laboratories, likely due to breakpoints in repetitive regions. Here, we study 12 inversions by long-read genome sequencing (lrGS) (n = 9) or srGS (n = 3) and resolve nine of them. In four cases, the inversion breakpoint region was missing from at least one of the human reference genomes (GRCh37, GRCh38, T2T-CHM13) and a reference agnostic analysis was needed. One of these cases, an INV9 mappable only in de novo assembled lrGS data using T2T-CHM13 disrupts EHMT1 consistent with a Mendelian diagnosis (Kleefstra syndrome 1; MIM#610253). Next, by pairwise comparison between T2T-CHM13, GRCh37, and GRCh38, as well as the chimpanzee and bonobo, we show that hundreds of megabases of sequence are missing from at least one human reference, highlighting that primate genomes contribute to genomic diversity. Aligning population genomic data to these regions indicated that these regions are variable between individuals. Our analysis emphasizes that T2T-CHM13 is necessary to maximize the value of lrGS for optimal inversion detection in clinical diagnostics. These results highlight the importance of leveraging diverse and comprehensive reference genomes to resolve unsolved molecular cases in rare diseases.
染色体倒位(INVs)由于其拷贝数中性状态和与重复区的关联,检测起来特别具有挑战性。倒位约占所有平衡染色体结构畸变的 1/20,可通过基因中断或改变顺式剂量敏感基因的调控区而导致疾病。短读基因组测序(srGS)只能解决临床诊断实验室转来的70%细胞遗传学上可见的倒位,这可能是由于重复区域的断点造成的。在这里,我们通过长线程基因组测序(lrGS)(n = 9)或 srGS(n = 3)研究了 12 例倒位,并解决了其中的 9 例。在四个案例中,反转断点区域在至少一个人类参考基因组(GRCh37、GRCh38、T2T-CHM13)中缺失,因此需要进行参考不可知分析。在这些病例中,有一个 INV9 仅在使用 T2T-CHM13 的从头组装 lrGS 数据中可映射,它破坏了 EHMT1,与孟德尔诊断一致(Kleefstra 综合征 1;MIM#610253)。接下来,通过对 T2T-CHM13、GRCh37 和 GRCh38 以及黑猩猩和倭黑猩猩进行配对比较,我们发现至少有一个人类参考文献缺失了数百兆字节的序列,这突出表明灵长类动物基因组对基因组多样性做出了贡献。将群体基因组数据与这些区域进行比对表明,这些区域在不同个体之间存在差异。我们的分析强调,T2T-CHM13 对临床诊断中的最佳反转检测来说,是最大化 lrGS 价值的必要条件。这些结果突显了利用多样化和全面的参考基因组解决罕见病未解决的分子病例的重要性。
{"title":"Leveraging the T2T assembly to resolve rare and pathogenic inversions in reference genome gaps","authors":"Kristine Bilgrav Saether, Jesper Eisfeldt, Jesse D. Bengtsson, Ming Yin Lun, Christopher M. Grochowski, Medhat Mahmoud, Hsiao-Tuan Chao, Jill A. Rosenfeld, Pengfei Liu, Marlene Ek, Jakob Schuy, Adam Ameur, Hongzheng Dai, Undiagnosed Diseases Network, James Paul Hwang, Fritz J. Sedlazeck, Weimin Bi, Ronit Marom, Josephine Wincent, Ann Nordgren, Claudia M.B. Carvalho, Anna Lindstrand","doi":"10.1101/gr.279346.124","DOIUrl":"https://doi.org/10.1101/gr.279346.124","url":null,"abstract":"Chromosomal inversions (INVs) are particularly challenging to detect due to their copy-number neutral state and association with repetitive regions. Inversions represent about 1/20 of all balanced structural chromosome aberrations and can lead to disease by gene disruption or altering regulatory regions of dosage-sensitive genes in <em>cis</em>. Short-read genome sequencing (srGS) can only resolve ∼70% of cytogenetically visible inversions referred to clinical diagnostic laboratories, likely due to breakpoints in repetitive regions. Here, we study 12 inversions by long-read genome sequencing (lrGS) (<em>n</em> = 9) or srGS (<em>n</em> = 3) and resolve nine of them. In four cases, the inversion breakpoint region was missing from at least one of the human reference genomes (GRCh37, GRCh38, T2T-CHM13) and a reference agnostic analysis was needed. One of these cases, an INV9 mappable only in de novo assembled lrGS data using T2T-CHM13 disrupts <em>EHMT1</em> consistent with a Mendelian diagnosis (Kleefstra syndrome 1; MIM#610253). Next, by pairwise comparison between T2T-CHM13, GRCh37, and GRCh38, as well as the chimpanzee and bonobo, we show that hundreds of megabases of sequence are missing from at least one human reference, highlighting that primate genomes contribute to genomic diversity. Aligning population genomic data to these regions indicated that these regions are variable between individuals. Our analysis emphasizes that T2T-CHM13 is necessary to maximize the value of lrGS for optimal inversion detection in clinical diagnostics. These results highlight the importance of leveraging diverse and comprehensive reference genomes to resolve unsolved molecular cases in rare diseases.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation. 对来自 "1000 基因组计划 "的样本进行高覆盖率纳米孔测序,建立人类遗传变异综合目录。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-30 DOI: 10.1101/gr.279273.124
Jonas A Gustafson, Sophia B Gibson, Nikhita Damaraju, Miranda P G Zalusky, Kendra Hoekzema, David Twesigomwe, Lei Yang, Anthony A Snead, Phillip A Richmond, Wouter De Coster, Nathan D Olson, Andrea Guarracino, Qiuhui Li, Angela L Miller, Joy Goffena, Zachary B Anderson, Sophie H R Storz, Sydney A Ward, Maisha Sinha, Claudia Gonzaga-Jauregui, Wayne E Clarke, Anna O Basile, André Corvelo, Catherine Reeves, Adrienne Helland, Rajeeva Lochan Musunuri, Mahler Revsine, Karynne E Patterson, Cate R Paschal, Christina Zakarian, Sara Goodwin, Tanner D Jensen, Esther Robb, W Richard McCombie, Fritz J Sedlazeck, Justin M Zook, Stephen B Montgomery, Erik Garrison, Mikhail Kolmogorov, Michael C Schatz, Richard N McLaughlin, Harriet Dashnow, Michael C Zody, Matt Loose, Miten Jain, Evan E Eichler, Danny E Miller

Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.

只有不到一半的孟德尔或单基因疑似病例在经过全面的临床基因检测后获得了精确的分子诊断。数据质量和成本的提高提高了人们对使用长读程测序(LRS)简化临床基因组检测的兴趣,但由于缺乏用于变异筛选和优先排序的对照数据集,LRS 数据的三级分析具有挑战性。为了解决这个问题,1000 基因组计划 ONT 测序联盟的目标是从 1000 基因组计划中至少 800 个样本中生成 LRS 数据。我们的目标是利用 LRS 来识别更广泛的变异,从而提高我们对人类正常变异模式的理解。在这里,我们展示了对代表所有 5 个超级种群和 19 个亚种群的前 100 个样本的分析数据。这些样本的平均测序覆盖深度为 37 倍,测序读数 N50 为 54 kbp,在识别同源多聚物区域之外的单核苷酸和滞后变异方面与之前的研究具有很高的一致性。通过使用多个结构变异(SV)调用器,我们在每个基因组中平均鉴定出 24,543 个高置信度 SV,其中包括可能破坏基因功能的共享和私有 SV,以及使用短读数无法检测到的疾病相关重复序列中的致病性扩增。对甲基化特征的评估揭示了已知印迹位点的预期模式、具有偏斜 X 失活模式的样本以及新的差异甲基化区域。所有原始测序数据、处理过的数据和统计摘要都是公开的,为临床遗传学界发现致病性 SV 提供了宝贵的资源。
{"title":"High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation.","authors":"Jonas A Gustafson, Sophia B Gibson, Nikhita Damaraju, Miranda P G Zalusky, Kendra Hoekzema, David Twesigomwe, Lei Yang, Anthony A Snead, Phillip A Richmond, Wouter De Coster, Nathan D Olson, Andrea Guarracino, Qiuhui Li, Angela L Miller, Joy Goffena, Zachary B Anderson, Sophie H R Storz, Sydney A Ward, Maisha Sinha, Claudia Gonzaga-Jauregui, Wayne E Clarke, Anna O Basile, André Corvelo, Catherine Reeves, Adrienne Helland, Rajeeva Lochan Musunuri, Mahler Revsine, Karynne E Patterson, Cate R Paschal, Christina Zakarian, Sara Goodwin, Tanner D Jensen, Esther Robb, W Richard McCombie, Fritz J Sedlazeck, Justin M Zook, Stephen B Montgomery, Erik Garrison, Mikhail Kolmogorov, Michael C Schatz, Richard N McLaughlin, Harriet Dashnow, Michael C Zody, Matt Loose, Miten Jain, Evan E Eichler, Danny E Miller","doi":"10.1101/gr.279273.124","DOIUrl":"10.1101/gr.279273.124","url":null,"abstract":"<p><p>Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142365031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Visualization and analysis of medically relevant tandem repeats in nanopore sequencing of control cohorts with pathSTR. 纳米孔测序中与医学相关的串联重复序列的可视化和分析,以及病理序列对照组。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-30 DOI: 10.1101/gr.279265.124
Wouter De Coster, Ida Höijer, Inge Bruggeman, Svenn D'Hert, Malin Melin, Adam Ameur, Rosa Rademakers

The lack of population-scale databases hampers research and diagnostics for medically relevant tandem repeats and repeat expansions. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in a healthy population. The current version includes 1040 individuals of The 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats has been genotyped using STRdust and LongTR to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this data set and the feature to upload one's data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with myotonic dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition and, as such, enable a better assessment of rare tandem repeat alleles observed in patients.

缺乏人群规模的数据库阻碍了与医学相关的串联重复序列和重复扩增的研究和诊断。我们试图利用我们的 pathSTR 网络工具填补这一空白,该工具利用大型队列的长读程测序来确定健康人群的重复序列长度和序列组成。当前版本包括在牛津纳米孔技术公司的 PromethION 上测序的 1000 基因组计划队列中的 1040 个个体。利用 STRdust 和 LongTR 对一组全面的医学相关串联重复序列进行了基因分型,以确定串联重复序列的长度和序列组成。PathSTR 为该数据集提供了丰富的可视化功能,并提供了上传个人数据以便与对照组数据进行比较的功能。我们利用一名 1 型肌张力营养不良症患者的定向纳米孔测序数据演示了这一应用的实施。这一资源将使遗传学界能够更全面地了解串联重复长度和序列组成的正常变异,从而更好地评估在患者身上观察到的罕见串联重复等位基因。
{"title":"Visualization and analysis of medically relevant tandem repeats in nanopore sequencing of control cohorts with pathSTR.","authors":"Wouter De Coster, Ida Höijer, Inge Bruggeman, Svenn D'Hert, Malin Melin, Adam Ameur, Rosa Rademakers","doi":"10.1101/gr.279265.124","DOIUrl":"10.1101/gr.279265.124","url":null,"abstract":"<p><p>The lack of population-scale databases hampers research and diagnostics for medically relevant tandem repeats and repeat expansions. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in a healthy population. The current version includes 1040 individuals of The 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats has been genotyped using STRdust and LongTR to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this data set and the feature to upload one's data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with myotonic dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition and, as such, enable a better assessment of rare tandem repeat alleles observed in patients.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141987779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Geometric deep learning framework for de novo genome assembly 用于从头开始基因组组装的几何深度学习框架
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-29 DOI: 10.1101/gr.279307.124
Lovro Vrček, Xavier Bresson, Thomas Laurent, Martin Schmitz, Kenji Kawaguchi, Mile Šikić
The critical stage of every de novo genome assembler is identifying paths in assembly graphs that correspond to the reconstructed genomic sequences. The existing algorithmic methods struggle with this, primarily due to repetitive regions causing complex graph tangles, leading to fragmented assemblies. Here, we introduce GNNome, a framework for path identification based on geometric deep learning that enables training models on assembly graphs without relying on existing assembly strategies. By leveraging only the symmetries inherent to the problem, GNNome reconstructs assemblies from PacBio HiFi reads with contiguity and quality comparable to those of the state-of-the-art tools across several species. With every new genome assembled telomere-to-telomere, the amount of reliable training data at our disposal increases. Combining the straightforward generation of abundant simulated data for diverse genomic structures with the AI approach makes the proposed framework a plausible cornerstone for future work on reconstructing complex genomes with different ploidy and aneuploidy degrees. To facilitate such developments, we make the framework and the best-performing model publicly available, provided as a tool that can directly be used to assemble new haploid genomes.
每一个全新基因组装配器的关键阶段都是在装配图中找出与重建的基因组序列相对应的路径。现有的算法方法很难做到这一点,主要原因是重复区域会造成复杂的图形纠结,从而导致组装结果支离破碎。在这里,我们介绍基于几何深度学习的路径识别框架 GNNome,它可以在不依赖现有组装策略的情况下在组装图上训练模型。GNNome 仅利用问题固有的对称性,就能从 PacBio HiFi 读数中重建组装,其连续性和质量可与多个物种的最先进工具相媲美。每组装一个新的端粒到端粒的基因组,我们就能获得更多可靠的训练数据。针对不同基因组结构直接生成大量模拟数据的方法与人工智能方法相结合,使我们提出的框架成为未来重建不同倍性和非整倍体程度的复杂基因组工作的可靠基石。为了促进这种发展,我们公开了该框架和表现最佳的模型,并将其作为一种可直接用于组装新单倍体基因组的工具。
{"title":"Geometric deep learning framework for de novo genome assembly","authors":"Lovro Vrček, Xavier Bresson, Thomas Laurent, Martin Schmitz, Kenji Kawaguchi, Mile Šikić","doi":"10.1101/gr.279307.124","DOIUrl":"https://doi.org/10.1101/gr.279307.124","url":null,"abstract":"The critical stage of every de novo genome assembler is identifying paths in assembly graphs that correspond to the reconstructed genomic sequences. The existing algorithmic methods struggle with this, primarily due to repetitive regions causing complex graph tangles, leading to fragmented assemblies. Here, we introduce GNNome, a framework for path identification based on geometric deep learning that enables training models on assembly graphs without relying on existing assembly strategies. By leveraging only the symmetries inherent to the problem, GNNome reconstructs assemblies from PacBio HiFi reads with contiguity and quality comparable to those of the state-of-the-art tools across several species. With every new genome assembled telomere-to-telomere, the amount of reliable training data at our disposal increases. Combining the straightforward generation of abundant simulated data for diverse genomic structures with the AI approach makes the proposed framework a plausible cornerstone for future work on reconstructing complex genomes with different ploidy and aneuploidy degrees. To facilitate such developments, we make the framework and the best-performing model publicly available, provided as a tool that can directly be used to assemble new haploid genomes.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142541287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Resolving complex duplication variants in autism spectrum disorder using long-read genome sequencing 利用长线程基因组测序解决自闭症谱系障碍中的复杂重复变异问题
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-29 DOI: 10.1101/gr.279263.124
Jesper Eisfeldt, Edward J. Higginbotham, Felix Lenner, Jennifer Howe, Bridget A. Fernandez, Anna Lindstrand, Stephen W. Scherer, Lars Feuk
Rare or de novo structural variation, primarily in the form of copy number variants, is detected in 5%–10% of autism spectrum disorder (ASD) families. While complex structural variants involving duplications can generally be detected using microarray or short-read genome sequencing (GS), these methods frequently fail to characterize breakpoints at nucleotide resolution, requiring additional molecular methods for validation and fine-mapping. Here, we use Oxford Nanopore Technologies PromethION long-read GS to characterize complex genomic rearrangements (CGRs) involving large duplications that segregate with ASD in five families. In total, we investigated 13 CGR carriers and were able to resolve all breakpoint junctions at nucleotide resolution. While all breakpoints were identified, the precise genomic architecture of one rearrangement remained unresolved with three different potential structures. The findings in two families include potential fusion genes formed through duplication rearrangements, involving IL1RAPL1–DMD and SUPT16H–CHD8. In two of the families originating from the same geographical region, an identical rearrangement involving ANK2 was identified, which likely represents a founder variant. In addition, we analyze methylation status directly from the long-read data, allowing us to assess the activity of rearranged genes and regulatory regions. Investigation of methylation across the CGRs reveals aberrant methylation status in carriers across a rearrangement affecting the CREBBP locus. In aggregate, our results demonstrate the utility of nanopore sequencing to pinpoint CGRs associated with ASD in five unrelated families, and highlight the importance of a gene-centric description of disease-associated complex chromosomal rearrangements.
在5%-10%的自闭症谱系障碍(ASD)家族中,可以检测到罕见的或从头开始的结构变异,主要以拷贝数变异的形式存在。虽然使用微阵列或短线程基因组测序(GS)通常可以检测到涉及重复的复杂结构变异,但这些方法经常无法以核苷酸分辨率描述断点的特征,因此需要额外的分子方法进行验证和精细图谱绘制。在这里,我们使用牛津纳米孔技术公司(Oxford Nanopore Technologies)的PromethION长读程基因组测序技术,对五个家族中与ASD分离的涉及大重复的复杂基因组重排(CGRs)进行了表征。我们总共研究了 13 个 CGR 携带者,并以核苷酸分辨率解析了所有断点连接。虽然确定了所有断点,但一个重排的精确基因组结构仍未确定,有三种不同的潜在结构。两个家系的发现包括通过重复重排形成的潜在融合基因,涉及 IL1RAPL1-DMD 和 SUPT16H-CHD8。在来自同一地理区域的两个家系中,发现了涉及 ANK2 的相同重排,这很可能是一个创始变异基因。此外,我们还直接从长读数数据中分析甲基化状态,从而评估重排基因和调控区域的活性。对整个 CGRs 的甲基化调查显示,在影响 CREBBP 基因座的重排中,携带者的甲基化状态异常。总之,我们的研究结果证明了纳米孔测序技术在确定五个无关联家族中与 ASD 相关的 CGRs 方面的实用性,并强调了以基因为中心描述与疾病相关的复杂染色体重排的重要性。
{"title":"Resolving complex duplication variants in autism spectrum disorder using long-read genome sequencing","authors":"Jesper Eisfeldt, Edward J. Higginbotham, Felix Lenner, Jennifer Howe, Bridget A. Fernandez, Anna Lindstrand, Stephen W. Scherer, Lars Feuk","doi":"10.1101/gr.279263.124","DOIUrl":"https://doi.org/10.1101/gr.279263.124","url":null,"abstract":"Rare or de novo structural variation, primarily in the form of copy number variants, is detected in 5%–10% of autism spectrum disorder (ASD) families. While complex structural variants involving duplications can generally be detected using microarray or short-read genome sequencing (GS), these methods frequently fail to characterize breakpoints at nucleotide resolution, requiring additional molecular methods for validation and fine-mapping. Here, we use Oxford Nanopore Technologies PromethION long-read GS to characterize complex genomic rearrangements (CGRs) involving large duplications that segregate with ASD in five families. In total, we investigated 13 CGR carriers and were able to resolve all breakpoint junctions at nucleotide resolution. While all breakpoints were identified, the precise genomic architecture of one rearrangement remained unresolved with three different potential structures. The findings in two families include potential fusion genes formed through duplication rearrangements, involving <em>IL1RAPL1–DMD</em> and <em>SUPT16H–CHD8</em>. In two of the families originating from the same geographical region, an identical rearrangement involving <em>ANK2</em> was identified, which likely represents a founder variant. In addition, we analyze methylation status directly from the long-read data, allowing us to assess the activity of rearranged genes and regulatory regions. Investigation of methylation across the CGRs reveals aberrant methylation status in carriers across a rearrangement affecting the <em>CREBBP</em> locus. In aggregate, our results demonstrate the utility of nanopore sequencing to pinpoint CGRs associated with ASD in five unrelated families, and highlight the importance of a gene-centric description of disease-associated complex chromosomal rearrangements.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142541288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Genome research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1