首页 > 最新文献

Genome research最新文献

英文 中文
Leveraging the power of long reads for targeted sequencing. 利用长读数的力量进行定向测序。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279168.124
Shruti V Iyer, Sara Goodwin, William Richard McCombie

Long-read sequencing technologies have improved the contiguity and, as a result, the quality of genome assemblies by generating reads long enough to span and resolve complex or repetitive regions of the genome. Several groups have shown the power of long reads in detecting thousands of genomic and epigenomic features that were previously missed by short-read sequencing approaches. While these studies demonstrate how long reads can help resolve repetitive and complex regions of the genome, they also highlight the throughput and coverage requirements needed to accurately resolve variant alleles across large populations using these platforms. At the time of this review, whole-genome long-read sequencing is more expensive than short-read sequencing on the highest throughput short-read instruments; thus, achieving sufficient coverage to detect low-frequency variants (such as somatic variation) in heterogenous samples remains challenging. Targeted sequencing, on the other hand, provides the depth necessary to detect these low-frequency variants in heterogeneous populations. Here, we review currently used and recently developed targeted sequencing strategies that leverage existing long-read technologies to increase the resolution with which we can look at nucleic acids in a variety of biological contexts.

长读数测序技术通过产生足够长的读数来跨越和解析基因组的复杂或重复区域,从而提高了基因组组装的连续性和质量。一些研究小组已经证明了长读数在检测数千个基因组和表观基因组特征方面的强大功能,而这些特征以前被短读数测序方法所遗漏。这些研究证明了长读数如何帮助解析基因组的重复和复杂区域,同时也强调了使用这些平台准确解析大量人群中的变异等位基因所需的通量和覆盖率要求。在撰写这篇综述时,在最高通量的短线程仪器上,全基因组长线程测序比短线程测序更昂贵;因此,在异源样本中实现足够的覆盖率以检测低频变异(如体细胞变异)仍然具有挑战性。另一方面,靶向测序可提供在异质人群中检测这些低频变异所需的深度。在这里,我们回顾了目前使用的和最近开发的靶向测序策略,这些策略利用现有的长读程技术提高了我们在各种生物环境中检测核酸的分辨率。
{"title":"Leveraging the power of long reads for targeted sequencing.","authors":"Shruti V Iyer, Sara Goodwin, William Richard McCombie","doi":"10.1101/gr.279168.124","DOIUrl":"10.1101/gr.279168.124","url":null,"abstract":"<p><p>Long-read sequencing technologies have improved the contiguity and, as a result, the quality of genome assemblies by generating reads long enough to span and resolve complex or repetitive regions of the genome. Several groups have shown the power of long reads in detecting thousands of genomic and epigenomic features that were previously missed by short-read sequencing approaches. While these studies demonstrate how long reads can help resolve repetitive and complex regions of the genome, they also highlight the throughput and coverage requirements needed to accurately resolve variant alleles across large populations using these platforms. At the time of this review, whole-genome long-read sequencing is more expensive than short-read sequencing on the highest throughput short-read instruments; thus, achieving sufficient coverage to detect low-frequency variants (such as somatic variation) in heterogenous samples remains challenging. Targeted sequencing, on the other hand, provides the depth necessary to detect these low-frequency variants in heterogeneous populations. Here, we review currently used and recently developed targeted sequencing strategies that leverage existing long-read technologies to increase the resolution with which we can look at nucleic acids in a variety of biological contexts.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"34 11","pages":"1701-1718"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142681505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation. 对来自 "1000 基因组计划 "的样本进行高覆盖率纳米孔测序,建立人类遗传变异综合目录。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279273.124
Jonas A Gustafson, Sophia B Gibson, Nikhita Damaraju, Miranda P G Zalusky, Kendra Hoekzema, David Twesigomwe, Lei Yang, Anthony A Snead, Phillip A Richmond, Wouter De Coster, Nathan D Olson, Andrea Guarracino, Qiuhui Li, Angela L Miller, Joy Goffena, Zachary B Anderson, Sophie H R Storz, Sydney A Ward, Maisha Sinha, Claudia Gonzaga-Jauregui, Wayne E Clarke, Anna O Basile, André Corvelo, Catherine Reeves, Adrienne Helland, Rajeeva Lochan Musunuri, Mahler Revsine, Karynne E Patterson, Cate R Paschal, Christina Zakarian, Sara Goodwin, Tanner D Jensen, Esther Robb, William Richard McCombie, Fritz J Sedlazeck, Justin M Zook, Stephen B Montgomery, Erik Garrison, Mikhail Kolmogorov, Michael C Schatz, Richard N McLaughlin, Harriet Dashnow, Michael C Zody, Matt Loose, Miten Jain, Evan E Eichler, Danny E Miller

Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.

只有不到一半的孟德尔或单基因疑似病例在经过全面的临床基因检测后获得了精确的分子诊断。数据质量和成本的提高提高了人们对使用长读程测序(LRS)简化临床基因组检测的兴趣,但由于缺乏用于变异筛选和优先排序的对照数据集,LRS 数据的三级分析具有挑战性。为了解决这个问题,1000 基因组计划 ONT 测序联盟的目标是从 1000 基因组计划中至少 800 个样本中生成 LRS 数据。我们的目标是利用 LRS 来识别更广泛的变异,从而提高我们对人类正常变异模式的理解。在这里,我们展示了对代表所有 5 个超级种群和 19 个亚种群的前 100 个样本的分析数据。这些样本的平均测序覆盖深度为 37 倍,测序读数 N50 为 54 kbp,在识别同源多聚物区域之外的单核苷酸和滞后变异方面与之前的研究具有很高的一致性。通过使用多个结构变异(SV)调用器,我们在每个基因组中平均鉴定出 24,543 个高置信度 SV,其中包括可能破坏基因功能的共享和私有 SV,以及使用短读数无法检测到的疾病相关重复序列中的致病性扩增。对甲基化特征的评估揭示了已知印迹位点的预期模式、具有偏斜 X 失活模式的样本以及新的差异甲基化区域。所有原始测序数据、处理过的数据和统计摘要都是公开的,为临床遗传学界发现致病性 SV 提供了宝贵的资源。
{"title":"High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation.","authors":"Jonas A Gustafson, Sophia B Gibson, Nikhita Damaraju, Miranda P G Zalusky, Kendra Hoekzema, David Twesigomwe, Lei Yang, Anthony A Snead, Phillip A Richmond, Wouter De Coster, Nathan D Olson, Andrea Guarracino, Qiuhui Li, Angela L Miller, Joy Goffena, Zachary B Anderson, Sophie H R Storz, Sydney A Ward, Maisha Sinha, Claudia Gonzaga-Jauregui, Wayne E Clarke, Anna O Basile, André Corvelo, Catherine Reeves, Adrienne Helland, Rajeeva Lochan Musunuri, Mahler Revsine, Karynne E Patterson, Cate R Paschal, Christina Zakarian, Sara Goodwin, Tanner D Jensen, Esther Robb, William Richard McCombie, Fritz J Sedlazeck, Justin M Zook, Stephen B Montgomery, Erik Garrison, Mikhail Kolmogorov, Michael C Schatz, Richard N McLaughlin, Harriet Dashnow, Michael C Zody, Matt Loose, Miten Jain, Evan E Eichler, Danny E Miller","doi":"10.1101/gr.279273.124","DOIUrl":"10.1101/gr.279273.124","url":null,"abstract":"<p><p>Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"2061-2073"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142365031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genomic epidemiology of carbapenem-resistant Enterobacterales at a New York City hospital over a 10-year period reveals complex plasmid-clone dynamics and evidence for frequent horizontal transfer of bla KPC. 纽约市一家医院十年间耐碳青霉烯类肠杆菌的基因组流行病学揭示了复杂的质粒克隆动态和 bla KPC 频繁水平转移的证据。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279355.124
Angela Gomez-Simmonds, Medini K Annavajhala, Dwayne Seeram, Todd W Hokunson, Heekuk Park, Anne-Catrin Uhlemann

Transmission of carbapenem-resistant Enterobacterales (CRE) in hospitals has been shown to occur through complex, multifarious networks driven by both clonal spread and horizontal transfer mediated by plasmids and other mobile genetic elements. We performed nanopore long-read sequencing on CRE isolates from a large urban hospital system to determine the overall contribution of plasmids to CRE transmission and identify specific plasmids implicated in the spread of bla KPC (the Klebsiella pneumoniae carbapenemase [KPC] gene). Six hundred and five CRE isolates collected between 2009 and 2018 first underwent Illumina sequencing for genome-wide genotyping; 435 bla KPC-positive isolates were then successfully nanopore sequenced to generate hybrid assemblies including circularized bla KPC-harboring plasmids. Phylogenetic analysis and Mash clustering were used to define putative clonal and plasmid transmission clusters, respectively. Overall, CRE isolates belonged to 96 multilocus sequence types (STs) encoding bla KPC on 447 plasmids which formed 54 plasmid clusters. We found evidence for clonal transmission in 66% of CRE isolates, over half of which belonged to four clades comprising K. pneumoniae ST258. Plasmid-mediated acquisition of bla KPC occurred in 23%-27% of isolates. While most plasmid clusters were small, several plasmids were identified in multiple different species and STs, including a highly promiscuous IncN plasmid and an IncF plasmid putatively spreading bla KPC from ST258 to other clones. Overall, this points to both the continued dominance of K. pneumoniae ST258 and the dissemination of bla KPC across clones and species by diverse plasmid backbones. These findings support integrating long-read sequencing into genomic surveillance approaches to detect the hitherto silent spread of carbapenem resistance driven by mobile plasmids.

耐碳青霉烯类肠杆菌(CRE)在医院中的传播已被证明是通过由质粒和其他移动遗传因子介导的克隆传播和水平转移所驱动的复杂而多样的网络进行的。我们对来自一个大型城市医院系统的 CRE 分离物进行了纳米孔长读数测序,以确定质粒对 CRE 传播的总体贡献,并识别与 bla KPC(肺炎克雷伯菌碳青霉烯酶 [KPC] 基因)传播有关的特定质粒。2009-2018 年间收集的 605 株 CRE 分离物首先进行了 Illumina 测序,以进行全基因组基因分型;然后对 435 株 bla KPC 阳性分离物进行了成功的纳米孔测序,以生成包括环化 bla KPC 携带质粒的杂交组合。系统发育分析和 Mash 聚类分别用于确定假定的克隆和质粒传播群。总体而言,CRE 分离物属于 96 个多焦点序列类型(ST),在 447 个质粒上编码 bla KPC,形成 54 个质粒群。我们在 66% 的 CRE 分离物中发现了克隆传播的证据,其中一半以上属于由肺炎克菌 ST258 组成的四个支系。23-27%的分离株通过质粒获得了 bla KPC。虽然大多数质粒群规模较小,但在多个不同物种和 ST 中发现了几种质粒,包括一种高度杂合的 IncN 质粒和一种可能将 bla KPC 从 ST258 传播到其他克隆的 IncF 质粒。总之,这表明肺炎克菌 ST258 仍处于优势地位,而 bla KPC 则通过不同的质粒骨架在克隆和物种间传播。这些发现支持将长读测序纳入基因组监测方法,以检测迄今为止由移动质粒驱动的碳青霉烯耐药性的无声传播。
{"title":"Genomic epidemiology of carbapenem-resistant Enterobacterales at a New York City hospital over a 10-year period reveals complex plasmid-clone dynamics and evidence for frequent horizontal transfer of <i>bla</i> <sub>KPC</sub>.","authors":"Angela Gomez-Simmonds, Medini K Annavajhala, Dwayne Seeram, Todd W Hokunson, Heekuk Park, Anne-Catrin Uhlemann","doi":"10.1101/gr.279355.124","DOIUrl":"10.1101/gr.279355.124","url":null,"abstract":"<p><p>Transmission of carbapenem-resistant Enterobacterales (CRE) in hospitals has been shown to occur through complex, multifarious networks driven by both clonal spread and horizontal transfer mediated by plasmids and other mobile genetic elements. We performed nanopore long-read sequencing on CRE isolates from a large urban hospital system to determine the overall contribution of plasmids to CRE transmission and identify specific plasmids implicated in the spread of <i>bla</i> <sub>KPC</sub> (the <i>Klebsiella pneumoniae</i> carbapenemase [KPC] gene). Six hundred and five CRE isolates collected between 2009 and 2018 first underwent Illumina sequencing for genome-wide genotyping; 435 <i>bla</i> <sub>KPC</sub>-positive isolates were then successfully nanopore sequenced to generate hybrid assemblies including circularized <i>bla</i> <sub>KPC</sub>-harboring plasmids. Phylogenetic analysis and Mash clustering were used to define putative clonal and plasmid transmission clusters, respectively. Overall, CRE isolates belonged to 96 multilocus sequence types (STs) encoding <i>bla</i> <sub>KPC</sub> on 447 plasmids which formed 54 plasmid clusters. We found evidence for clonal transmission in 66% of CRE isolates, over half of which belonged to four clades comprising <i>K. pneumoniae</i> ST258. Plasmid-mediated acquisition of <i>bla</i> <sub>KPC</sub> occurred in 23%-27% of isolates. While most plasmid clusters were small, several plasmids were identified in multiple different species and STs, including a highly promiscuous IncN plasmid and an IncF plasmid putatively spreading <i>bla</i> <sub>KPC</sub> from ST258 to other clones. Overall, this points to both the continued dominance of <i>K. pneumoniae</i> ST258 and the dissemination of <i>bla</i> <sub>KPC</sub> across clones and species by diverse plasmid backbones. These findings support integrating long-read sequencing into genomic surveillance approaches to detect the hitherto silent spread of carbapenem resistance driven by mobile plasmids.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1895-1907"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610580/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142375382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gapless assembly of complete human and plant chromosomes using only nanopore sequencing. 仅使用纳米孔测序技术无间隙组装完整的人类和植物染色体。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279334.124
Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautiainen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Olle Nordesjo, Sergey Nurk, Mike Vella, Katherine R Lawrence, Doreen Ware, Michael C Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H J Wittenberg, Adam M Phillippy

The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.

牛津纳米孔技术公司(ONT)的超长(UL)测序读数与太平洋生物科学公司(PacBio)的长而精确的高保真(HiFi)读数相结合,完成了人类基因组,并推动了完成许多其他物种基因组的类似工作。然而,这种 "端粒到端粒 "的完整基因组组装方法依赖于多个测序平台,限制了其可及性。ONT "双链 "测序读数可同时读取DNA的两条链以提高质量,并保证每个碱基的高准确性。为了评估这一新的数据类型,我们为三个被广泛研究的基因组生成了 ONT 双重数据:人类 HG002、Solanum lycopersicum Heinz 1706(番茄)和 Zea mays B73(玉米)。对于二倍体、杂合子 HG002 基因组,我们还使用了 "Pore-C "染色质接触图谱来完全分期单倍型。我们发现,Duplex 数据的准确性与 HiFi 测序相似,但读数长度长几十个千碱基,而 Pore-C 数据与现有的二倍体组装算法兼容。读数长度和准确性的结合使我们能够构建高质量的初始组装,然后利用 UL 读数进一步解析,最后利用 Pore-C 分阶段形成染色体规模的单倍型。最终的组装结果具有超过 99.999% 的碱基准确率(Q50)和近乎完美的连续性,大多数染色体组装为单个等位基因。我们的结论是,ONT 测序是全新基因组组装中 HiFi 测序的可行替代方案,并为重建完整基因组提供了多轮单仪器解决方案。
{"title":"Gapless assembly of complete human and plant chromosomes using only nanopore sequencing.","authors":"Sergey Koren, Zhigui Bao, Andrea Guarracino, Shujun Ou, Sara Goodwin, Katharine M Jenike, Julian Lucas, Brandy McNulty, Jimin Park, Mikko Rautiainen, Arang Rhie, Dick Roelofs, Harrie Schneiders, Ilse Vrijenhoek, Koen Nijbroek, Olle Nordesjo, Sergey Nurk, Mike Vella, Katherine R Lawrence, Doreen Ware, Michael C Schatz, Erik Garrison, Sanwen Huang, William Richard McCombie, Karen H Miga, Alexander H J Wittenberg, Adam M Phillippy","doi":"10.1101/gr.279334.124","DOIUrl":"10.1101/gr.279334.124","url":null,"abstract":"<p><p>The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, \"telomere-to-telomere\" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT \"Duplex\" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, <i>Solanum lycopersicum</i> Heinz 1706 (tomato), and <i>Zea mays</i> B73 (maize). For the diploid, heterozygous HG002 genome, we also used \"Pore-C\" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1919-1930"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610574/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142589915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Long-read subcellular fractionation and sequencing reveals the translational fate of full-length mRNA isoforms during neuronal differentiation. 长读数亚细胞分馏和测序揭示了全长 mRNA 同工型在神经元分化过程中的翻译命运。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279170.124
Alexander J Ritter, Jolene M Draper, Christopher Vollmers, Jeremy R Sanford

Alternative splicing (AS) alters the cis-regulatory landscape of mRNA isoforms, leading to transcripts with distinct localization, stability, and translational efficiency. To rigorously investigate mRNA isoform-specific ribosome association, we generated subcellular fractionation and sequencing (Frac-seq) libraries using both conventional short reads and long reads from human embryonic stem cells (ESCs) and neural progenitor cells (NPCs) derived from the same ESCs. We performed de novo transcriptome assembly from high-confidence long reads from cytosolic, monosomal, light, and heavy polyribosomal fractions and quantified their abundance using short reads from their respective subcellular fractions. Thousands of transcripts in each cell type exhibited association with particular subcellular fractions relative to the cytosol. Of the multi-isoform genes, 27% and 19% exhibited significant differential isoform sedimentation in ESCs and NPCs, respectively. Alternative promoter usage and internal exon skipping accounted for the majority of differences between isoforms from the same gene. Random forest classifiers implicated coding sequence (CDS) and untranslated region (UTR) lengths as important determinants of isoform-specific sedimentation profiles, and motif analyses reveal potential cell type-specific and subcellular fraction-associated RNA-binding protein signatures. Taken together, our data demonstrate that alternative mRNA processing within the CDS and UTRs impacts the translational control of mRNA isoforms during stem cell differentiation, and highlight the utility of using a novel long-read sequencing-based method to study translational control.

替代剪接(AS)改变了 mRNA 同工型的顺式调控结构,导致转录本具有不同的定位、稳定性和翻译效率。为了严格研究mRNA异构体特异性核糖体关联,我们使用传统的短读数和长读数生成了亚细胞分馏和测序(Frac-seq)文库,这些文库来自人类胚胎干细胞(ESC)和来自同一ESC的神经祖细胞(NPC)。我们利用来自细胞质、单体、轻型和重型多核糖体组分的高置信度长读数进行了从头转录组组装,并利用来自各自亚细胞组分的短读数量化了它们的丰度。与细胞质相比,每种细胞类型中都有数千个转录本与特定亚细胞组分相关。在多同工酶基因中,分别有 27% 和 19% 的基因在 ESC 和 NPC 中表现出明显的同工酶沉积差异。启动子的交替使用和内部外显子的跳转是造成同一基因不同异构体之间差异的主要原因。随机森林分类器表明,编码序列(CDS)和UTR长度是决定同工酶特异性沉降谱的重要因素,而主题分析揭示了潜在的细胞类型特异性和亚细胞组分相关的RNA结合蛋白特征。总之,我们的数据证明了在干细胞分化过程中,CDS和UTR内的mRNA替代加工影响了mRNA异构体的翻译控制,并突出了使用基于长读数测序的新型方法研究翻译控制的实用性。
{"title":"Long-read subcellular fractionation and sequencing reveals the translational fate of full-length mRNA isoforms during neuronal differentiation.","authors":"Alexander J Ritter, Jolene M Draper, Christopher Vollmers, Jeremy R Sanford","doi":"10.1101/gr.279170.124","DOIUrl":"10.1101/gr.279170.124","url":null,"abstract":"<p><p>Alternative splicing (AS) alters the <i>cis</i>-regulatory landscape of mRNA isoforms, leading to transcripts with distinct localization, stability, and translational efficiency. To rigorously investigate mRNA isoform-specific ribosome association, we generated subcellular fractionation and sequencing (Frac-seq) libraries using both conventional short reads and long reads from human embryonic stem cells (ESCs) and neural progenitor cells (NPCs) derived from the same ESCs. We performed de novo transcriptome assembly from high-confidence long reads from cytosolic, monosomal, light, and heavy polyribosomal fractions and quantified their abundance using short reads from their respective subcellular fractions. Thousands of transcripts in each cell type exhibited association with particular subcellular fractions relative to the cytosol. Of the multi-isoform genes, 27% and 19% exhibited significant differential isoform sedimentation in ESCs and NPCs, respectively. Alternative promoter usage and internal exon skipping accounted for the majority of differences between isoforms from the same gene. Random forest classifiers implicated coding sequence (CDS) and untranslated region (UTR) lengths as important determinants of isoform-specific sedimentation profiles, and motif analyses reveal potential cell type-specific and subcellular fraction-associated RNA-binding protein signatures. Taken together, our data demonstrate that alternative mRNA processing within the CDS and UTRs impacts the translational control of mRNA isoforms during stem cell differentiation, and highlight the utility of using a novel long-read sequencing-based method to study translational control.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"2000-2011"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141261622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Haplotype-resolved genome and population genomics of the threatened garden dormouse in Europe. 欧洲濒危花园睡鼠的单倍型基因组和种群基因组学。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279066.124
Paige A Byerly, Alina von Thaden, Evgeny Leushkin, Leon Hilgers, Shenglin Liu, Sven Winter, Tilman Schell, Charlotte Gerheim, Alexander Ben Hamadou, Carola Greve, Christian Betz, Hanno J Bolz, Sven Büchner, Johannes Lang, Holger Meinig, Evax Marie Famira-Parcsetich, Sarah P Stubbe, Alice Mouton, Sandro Bertolino, Goedele Verbeylen, Thomas Briner, Lídia Freixas, Lorenzo Vinciguerra, Sarah A Mueller, Carsten Nowak, Michael Hiller

Genomic resources are important for evaluating genetic diversity and supporting conservation efforts. The garden dormouse (Eliomys quercinus) is a small rodent that has experienced one of the most severe modern population declines in Europe. We present a high-quality haplotype-resolved reference genome for the garden dormouse, and combine comprehensive short and long-read transcriptomics data sets with homology-based methods to generate a highly complete gene annotation. Demographic history analysis of the genome reveal a sharp population decline since the last interglacial, indicating an association between colder climates and population declines before anthropogenic influence. Using our genome and genetic data from 100 individuals, largely sampled in a citizen-science project across the contemporary range, we conduct the first population genomic analysis for this species. We find clear evidence for population structure across the species' core Central European range. Notably, our data show that the Alpine population, characterized by strong differentiation and reduced genetic diversity, is reproductively isolated from other regions and likely represents a differentiated evolutionary significant unit (ESU). The predominantly declining Eastern European populations also show signs of recent isolation, a pattern consistent with a range expansion from Western to Eastern Europe during the Holocene, leaving relict populations now facing local extinction. Overall, our findings suggest that garden dormouse conservation may be enhanced in Europe through the designation of ESUs.

基因组资源对于评估遗传多样性和支持保护工作非常重要。花园睡鼠(Eliomys quercinus)是一种小型啮齿类动物,是欧洲现代种群数量下降最严重的动物之一。我们为花园睡鼠提供了一个高质量的单倍型解析参考基因组,并将全面的短线程和长线程转录组学数据集与基于同源性的方法相结合,生成了高度完整的基因注释。基因组的种群历史分析表明,花园睡鼠的种群数量自上一次间冰期以来急剧下降,这表明在人类活动影响之前,寒冷气候与种群数量下降之间存在关联。利用我们的基因组和来自 100 个个体的遗传数据,我们首次对该物种进行了种群基因组分析。我们发现了该物种在中欧核心分布区种群结构的明显证据。值得注意的是,我们的数据显示,阿尔卑斯山种群具有强烈分化和遗传多样性降低的特点,在繁殖上与其他地区隔离,很可能代表了一个分化的重要进化单元(ESU)。以衰退为主的东欧种群也显示出近期隔离的迹象,这种模式与全新世期间从西欧向东欧扩张的分布范围一致,留下的孑遗种群目前正面临局部灭绝。总之,我们的研究结果表明,可以通过指定 ESU 来加强欧洲花园睡鼠的保护。
{"title":"Haplotype-resolved genome and population genomics of the threatened garden dormouse in Europe.","authors":"Paige A Byerly, Alina von Thaden, Evgeny Leushkin, Leon Hilgers, Shenglin Liu, Sven Winter, Tilman Schell, Charlotte Gerheim, Alexander Ben Hamadou, Carola Greve, Christian Betz, Hanno J Bolz, Sven Büchner, Johannes Lang, Holger Meinig, Evax Marie Famira-Parcsetich, Sarah P Stubbe, Alice Mouton, Sandro Bertolino, Goedele Verbeylen, Thomas Briner, Lídia Freixas, Lorenzo Vinciguerra, Sarah A Mueller, Carsten Nowak, Michael Hiller","doi":"10.1101/gr.279066.124","DOIUrl":"10.1101/gr.279066.124","url":null,"abstract":"<p><p>Genomic resources are important for evaluating genetic diversity and supporting conservation efforts. The garden dormouse (<i>Eliomys quercinus</i>) is a small rodent that has experienced one of the most severe modern population declines in Europe. We present a high-quality haplotype-resolved reference genome for the garden dormouse, and combine comprehensive short and long-read transcriptomics data sets with homology-based methods to generate a highly complete gene annotation. Demographic history analysis of the genome reveal a sharp population decline since the last interglacial, indicating an association between colder climates and population declines before anthropogenic influence. Using our genome and genetic data from 100 individuals, largely sampled in a citizen-science project across the contemporary range, we conduct the first population genomic analysis for this species. We find clear evidence for population structure across the species' core Central European range. Notably, our data show that the Alpine population, characterized by strong differentiation and reduced genetic diversity, is reproductively isolated from other regions and likely represents a differentiated evolutionary significant unit (ESU). The predominantly declining Eastern European populations also show signs of recent isolation, a pattern consistent with a range expansion from Western to Eastern Europe during the Holocene, leaving relict populations now facing local extinction. Overall, our findings suggest that garden dormouse conservation may be enhanced in Europe through the designation of ESUs.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"2094-2107"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610594/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142618653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Challenges in identifying mRNA transcript starts and ends from long-read sequencing data. 从长线程测序数据中识别 mRNA 转录本起点和终点的挑战。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279559.124
Ezequiel Calvo-Roitberg, Rachel F Daniels, Athma A Pai

Long-read sequencing (LRS) technologies have the potential to revolutionize scientific discoveries in RNA biology through the comprehensive identification and quantification of full-length mRNA isoforms. Despite great promise, challenges remain in the widespread implementation of LRS technologies for RNA-based applications, including concerns about low coverage, high sequencing error, and robust computational pipelines. Although much focus has been placed on defining mRNA exon composition and structure with LRS data, less careful characterization has been done of the ability to assess the terminal ends of isoforms, specifically, transcription start and end sites. Such characterization is crucial for completely delineating full mRNA molecules and regulatory consequences. However, there are substantial inconsistencies in both start and end coordinates of LRS reads spanning a gene, such that LRS reads often fail to accurately recapitulate annotated or empirically derived terminal ends of mRNA molecules. Here, we describe the specific challenges of identifying and quantifying mRNA terminal ends with LRS technologies and how these issues influence biological interpretations of LRS data. We then review recent experimental and computational advances designed to alleviate these problems, with ideal use cases for each approach. Finally, we outline anticipated developments and necessary improvements for the characterization of terminal ends from LRS data.

长序列测序(LRS)技术通过全面鉴定和量化全长 mRNA 异构体,有可能彻底改变 RNA 生物学的科学发现。尽管前景广阔,但在基于 RNA 的应用中广泛实施 LRS 技术仍面临挑战,包括低覆盖率、高测序误差和强大的计算管道等问题。虽然利用 LRS 数据定义 mRNA 外显子组成和结构已成为关注的焦点,但对评估同工酶末端的能力,特别是转录起始和终止位点的特征描述却不那么仔细。这种特征描述对于完全描述完整的 mRNA 分子和调控结果至关重要。然而,横跨一个基因的 LRS 读数的起点和终点坐标存在很大的不一致性,因此 LRS 读数往往不能准确再现 mRNA 分子的注释或经验推导的末端。在此,我们将介绍利用 LRS 技术识别和量化 mRNA 末端的具体挑战,以及这些问题如何影响 LRS 数据的生物学解释。然后,我们回顾了旨在缓解这些问题的最新实验和计算进展,以及每种方法的理想用例。最后,我们概述了从 LRS 数据表征末端的预期发展和必要改进。
{"title":"Challenges in identifying mRNA transcript starts and ends from long-read sequencing data.","authors":"Ezequiel Calvo-Roitberg, Rachel F Daniels, Athma A Pai","doi":"10.1101/gr.279559.124","DOIUrl":"10.1101/gr.279559.124","url":null,"abstract":"<p><p>Long-read sequencing (LRS) technologies have the potential to revolutionize scientific discoveries in RNA biology through the comprehensive identification and quantification of full-length mRNA isoforms. Despite great promise, challenges remain in the widespread implementation of LRS technologies for RNA-based applications, including concerns about low coverage, high sequencing error, and robust computational pipelines. Although much focus has been placed on defining mRNA exon composition and structure with LRS data, less careful characterization has been done of the ability to assess the terminal ends of isoforms, specifically, transcription start and end sites. Such characterization is crucial for completely delineating full mRNA molecules and regulatory consequences. However, there are substantial inconsistencies in both start and end coordinates of LRS reads spanning a gene, such that LRS reads often fail to accurately recapitulate annotated or empirically derived terminal ends of mRNA molecules. Here, we describe the specific challenges of identifying and quantifying mRNA terminal ends with LRS technologies and how these issues influence biological interpretations of LRS data. We then review recent experimental and computational advances designed to alleviate these problems, with ideal use cases for each approach. Finally, we outline anticipated developments and necessary improvements for the characterization of terminal ends from LRS data.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"34 11","pages":"1719-1734"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142681504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding isoform expression by pairing long-read sequencing with single-cell and spatial transcriptomics. 通过将长线程测序与单细胞和空间转录组学配对,了解同工酶表达。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279640.124
Natan Belchikov, Justine Hsu, Xiang Jennie Li, Julien Jarroux, Wen Hu, Anoushka Joglekar, Hagen U Tilgner

RNA isoform diversity, produced via alternative splicing, and alternative usage of transcription start and poly(A) sites, results in varied transcripts being derived from the same gene. Distinct isoforms can play important biological roles, including by changing the sequences or expression levels of protein products. The first single-cell approaches to RNA sequencing-and later, spatial approaches-which are now widely used for the identification of differentially expressed genes, rely on short reads and offer the ability to transcriptomically compare different cell types but are limited in their ability to measure differential isoform expression. More recently, long-read sequencing methods have been combined with single-cell and spatial technologies in order to characterize isoform expression. In this review, we provide an overview of the emergence of single-cell and spatial long-read sequencing and discuss the challenges associated with the implementation of these technologies and interpretation of these data. We discuss the opportunities they offer for understanding the relationships between the distinct variable elements of transcript molecules and highlight some of the ways in which they have been used to characterize isoforms' roles in development and pathology. Single-nucleus long-read sequencing, a special case of the single-cell approach, is also discussed. We attempt to cover both the limitations of these technologies and their significant potential for expanding our still-limited understanding of the biological roles of RNA isoforms.

RNA 异构体的多样性是通过替代剪接、转录起始位点和 poly(A)位点的替代使用产生的,它导致同一基因产生不同的转录本。不同的同工酶能发挥重要的生物学作用,包括改变蛋白质产物的序列或表达水平。最早的单细胞 RNA 测序方法--以及后来的空间测序方法--现在被广泛用于鉴定差异表达基因,这些方法依赖于短读数,能够对不同类型的细胞进行转录组学比较,但在测量差异异构体表达方面能力有限。最近,长读数测序方法与单细胞和空间技术相结合,以确定同工酶表达的特征。在这篇综述中,我们概述了单细胞和空间长线程测序的出现,并讨论了与这些技术的实施和数据解读相关的挑战。我们讨论了这些技术为了解转录本分子中不同可变元素之间的关系提供的机会,并重点介绍了这些技术用于描述同工型在发育和病理中的作用的一些方法。我们还讨论了单细胞方法的一个特例--单核长读测序。我们试图说明这些技术的局限性及其在拓展我们对 RNA 同工酶生物学作用的有限认识方面的巨大潜力。
{"title":"Understanding isoform expression by pairing long-read sequencing with single-cell and spatial transcriptomics.","authors":"Natan Belchikov, Justine Hsu, Xiang Jennie Li, Julien Jarroux, Wen Hu, Anoushka Joglekar, Hagen U Tilgner","doi":"10.1101/gr.279640.124","DOIUrl":"10.1101/gr.279640.124","url":null,"abstract":"<p><p>RNA isoform diversity, produced via alternative splicing, and alternative usage of transcription start and poly(A) sites, results in varied transcripts being derived from the same gene. Distinct isoforms can play important biological roles, including by changing the sequences or expression levels of protein products. The first single-cell approaches to RNA sequencing-and later, spatial approaches-which are now widely used for the identification of differentially expressed genes, rely on short reads and offer the ability to transcriptomically compare different cell types but are limited in their ability to measure differential isoform expression. More recently, long-read sequencing methods have been combined with single-cell and spatial technologies in order to characterize isoform expression. In this review, we provide an overview of the emergence of single-cell and spatial long-read sequencing and discuss the challenges associated with the implementation of these technologies and interpretation of these data. We discuss the opportunities they offer for understanding the relationships between the distinct variable elements of transcript molecules and highlight some of the ways in which they have been used to characterize isoforms' roles in development and pathology. Single-nucleus long-read sequencing, a special case of the single-cell approach, is also discussed. We attempt to cover both the limitations of these technologies and their significant potential for expanding our still-limited understanding of the biological roles of RNA isoforms.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"34 11","pages":"1735-1746"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610585/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142681553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools. 使用 fibertools 进行 DNA-m6A 调用和综合长读数表观遗传学和基因分析。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279095.124
Anupama Jha, Stephanie C Bohaczuk, Yizi Mao, Jane Ranchalis, Benjamin J Mallory, Alan T Min, Morgan O Hamm, Elliott Swanson, Danilo Dubocanin, Connor Finkbeiner, Tony Li, Dale Whittington, William Stafford Noble, Andrew B Stergachis, Mitchell R Vollger

Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation and the identification of exogenously placed DNA N 6 -methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce fibertools, a state-of-the-art toolkit that features a semisupervised convolutional neural network for fast and accurate identification of m6A-marked bases using Pacific Biosciences (PacBio) single-molecule long-read sequencing, as well as the coprocessing of long-read genetic and epigenetic data produced using either the PacBio or Oxford Nanopore Technologies (ONT) sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kb long DNA molecules with an ∼1000-fold improvement in speed. In addition, we demonstrate that fibertools can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.

长线DNA测序最近已成为以单分子和单核苷酸分辨率研究遗传和表观遗传结构的有力工具。长读表观遗传学研究既包括直接鉴定原生胞嘧啶甲基化,也包括鉴定外源 DNA N6-甲基腺嘌呤(DNA-m6A)。然而,利用单分子测序检测DNA-m6A修饰,以及共同处理单分子遗传和表观遗传结构,都受到计算需求和支持工具缺乏的限制。在这里,我们介绍了最先进的工具包 fibertools,它采用半监督卷积神经网络,利用 PacBio 单分子长读数测序技术快速准确地识别 m6A 标记碱基,并对利用 PacBio 或 Oxford Nanopore 测序平台产生的长读数遗传和表观遗传数据进行协同处理。我们展示了对长度大于 20 千碱基的 DNA 分子进行精确的 DNA-m6A 鉴定(精确度和召回率大于 90%),速度提高了约 1000 倍。此外,我们还证明了 fibertools 能以单分子分辨率轻松整合遗传和表观遗传数据,包括分子坐标系和参考坐标系之间的无缝转换,从而能在结构和体细胞可变的基因组区域内对长读数据进行准确的遗传和表观遗传分析。
{"title":"DNA-m6A calling and integrated long-read epigenetic and genetic analysis with <i>fibertools</i>.","authors":"Anupama Jha, Stephanie C Bohaczuk, Yizi Mao, Jane Ranchalis, Benjamin J Mallory, Alan T Min, Morgan O Hamm, Elliott Swanson, Danilo Dubocanin, Connor Finkbeiner, Tony Li, Dale Whittington, William Stafford Noble, Andrew B Stergachis, Mitchell R Vollger","doi":"10.1101/gr.279095.124","DOIUrl":"10.1101/gr.279095.124","url":null,"abstract":"<p><p>Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation and the identification of exogenously placed DNA <i>N</i> <sup><i>6</i></sup> -methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce <i>fibertools</i>, a state-of-the-art toolkit that features a semisupervised convolutional neural network for fast and accurate identification of m6A-marked bases using Pacific Biosciences (PacBio) single-molecule long-read sequencing, as well as the coprocessing of long-read genetic and epigenetic data produced using either the PacBio or Oxford Nanopore Technologies (ONT) sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kb long DNA molecules with an ∼1000-fold improvement in speed. In addition, we demonstrate that <i>fibertools</i> can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1976-1986"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610455/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141288010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Visualization and analysis of medically relevant tandem repeats in nanopore sequencing of control cohorts with pathSTR. 纳米孔测序中与医学相关的串联重复序列的可视化和分析,以及病理序列对照组。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-11-20 DOI: 10.1101/gr.279265.124
Wouter De Coster, Ida Höijer, Inge Bruggeman, Svenn D'Hert, Malin Melin, Adam Ameur, Rosa Rademakers

The lack of population-scale databases hampers research and diagnostics for medically relevant tandem repeats and repeat expansions. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in a healthy population. The current version includes 1040 individuals of The 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats has been genotyped using STRdust and LongTR to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this data set and the feature to upload one's data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with myotonic dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition and, as such, enable a better assessment of rare tandem repeat alleles observed in patients.

缺乏人群规模的数据库阻碍了与医学相关的串联重复序列和重复扩增的研究和诊断。我们试图利用我们的 pathSTR 网络工具填补这一空白,该工具利用大型队列的长读程测序来确定健康人群的重复序列长度和序列组成。当前版本包括在牛津纳米孔技术公司的 PromethION 上测序的 1000 基因组计划队列中的 1040 个个体。利用 STRdust 和 LongTR 对一组全面的医学相关串联重复序列进行了基因分型,以确定串联重复序列的长度和序列组成。PathSTR 为该数据集提供了丰富的可视化功能,并提供了上传个人数据以便与对照组数据进行比较的功能。我们利用一名 1 型肌张力营养不良症患者的定向纳米孔测序数据演示了这一应用的实施。这一资源将使遗传学界能够更全面地了解串联重复长度和序列组成的正常变异,从而更好地评估在患者身上观察到的罕见串联重复等位基因。
{"title":"Visualization and analysis of medically relevant tandem repeats in nanopore sequencing of control cohorts with pathSTR.","authors":"Wouter De Coster, Ida Höijer, Inge Bruggeman, Svenn D'Hert, Malin Melin, Adam Ameur, Rosa Rademakers","doi":"10.1101/gr.279265.124","DOIUrl":"10.1101/gr.279265.124","url":null,"abstract":"<p><p>The lack of population-scale databases hampers research and diagnostics for medically relevant tandem repeats and repeat expansions. We attempt to fill this gap using our pathSTR web tool, which leverages long-read sequencing of large cohorts to determine repeat length and sequence composition in a healthy population. The current version includes 1040 individuals of The 1000 Genomes Project cohort sequenced on the Oxford Nanopore Technologies PromethION. A comprehensive set of medically relevant tandem repeats has been genotyped using STRdust and LongTR to determine the tandem repeat length and sequence composition. PathSTR provides rich visualizations of this data set and the feature to upload one's data for comparison along the control cohort. We demonstrate the implementation of this application using data from targeted nanopore sequencing of a patient with myotonic dystrophy type 1. This resource will empower the genetics community to get a more complete overview of normal variation in tandem repeat length and sequence composition and, as such, enable a better assessment of rare tandem repeat alleles observed in patients.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"2074-2080"},"PeriodicalIF":6.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610575/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141987779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Genome research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1