Noa Yaffa Kan-Lingwood, Liran Sagi, Shahar Mazie, Naama Shahar, Lilith Zecherle Bitton, Alan Templeton, Daniel Rubenstein, Amos Bouskila, Shirli Bar-David
A major challenge in analysing single-nucleotide polymorphism (SNP) genotype datasets is detecting and filtering errors that bias analyses and misinterpret ecological and evolutionary processes. Here, we present a comprehensive method to estimate and minimise genotyping error rates (deviations from the 'true' genotype) in any SNP datasets using triplicates (three repeats of the same sample) in a four-step filtration pipeline. The approach involves: (1) SNP filtering by missing data; (2) SNP filtering by error rates; (3) sample filtering by missing data and (4) detection of recaptured individuals by using estimated SNP error rates. The modular pipeline is provided in an R script that allows customised adjustments. We demonstrate the applicability of the method using non-invasive sampling from the Asiatic wild ass (Equus hemionus) population in Israel. We genotyped 756 samples using 625 SNPs, of which 255 were triplicates of 85 samples. The average SNP error rate, calculated based on the number of mismatching genotypes across triplicates before filtration, was 0.0034 and was reduced to 0.00174 following filtration. Evaluating genetic distance (GD) and relatedness (r) between triplicates before and after filtration (expected to be at the minimum and maximum respectively) showed a significant reduction in the average GD, from 58.1 to 25.3 (p = 0.0002) and a significant increase in relatedness, from r = 0.98 to r = 0.991 (p = 0.00587). We demonstrate how error rate estimation enhances recapture detection and improves genotype quality.
分析单核苷酸多态性(SNP)基因型数据集的一个主要挑战是检测和过滤错误,这些错误会使分析产生偏差并误解生态和进化过程。在这里,我们提出了一种综合方法,利用三重样本(同一样本的三次重复)在四步过滤管道中估算并最小化任何 SNP 数据集中的基因分型错误率(与 "真实 "基因型的偏差)。该方法包括:(1) 根据缺失数据过滤 SNP;(2) 根据错误率过滤 SNP;(3) 根据缺失数据过滤样本;(4) 根据估计的 SNP 错误率检测重新捕获的个体。该模块化管道以 R 脚本的形式提供,可进行定制调整。我们利用对以色列亚洲野驴(Equus hemionus)种群的非侵入性采样证明了该方法的适用性。我们使用 625 个 SNP 对 756 个样本进行了基因分型,其中 255 个样本是 85 个样本的三倍体。根据过滤前三重样本中不匹配基因型的数量计算,SNP 平均错误率为 0.0034,过滤后降至 0.00174。评估过滤前后(预计分别为最小值和最大值)三重样之间的遗传距离(GD)和亲缘关系(r)显示,平均 GD 显著降低,从 58.1 降至 25.3(p = 0.0002),亲缘关系显著增加,从 r = 0.98 升至 r = 0.991(p = 0.00587)。我们展示了误差率估计是如何增强再捕获检测并提高基因型质量的。
{"title":"Genotyping Error Detection and Customised Filtration for SNP Datasets.","authors":"Noa Yaffa Kan-Lingwood, Liran Sagi, Shahar Mazie, Naama Shahar, Lilith Zecherle Bitton, Alan Templeton, Daniel Rubenstein, Amos Bouskila, Shirli Bar-David","doi":"10.1111/1755-0998.14033","DOIUrl":"https://doi.org/10.1111/1755-0998.14033","url":null,"abstract":"<p><p>A major challenge in analysing single-nucleotide polymorphism (SNP) genotype datasets is detecting and filtering errors that bias analyses and misinterpret ecological and evolutionary processes. Here, we present a comprehensive method to estimate and minimise genotyping error rates (deviations from the 'true' genotype) in any SNP datasets using triplicates (three repeats of the same sample) in a four-step filtration pipeline. The approach involves: (1) SNP filtering by missing data; (2) SNP filtering by error rates; (3) sample filtering by missing data and (4) detection of recaptured individuals by using estimated SNP error rates. The modular pipeline is provided in an R script that allows customised adjustments. We demonstrate the applicability of the method using non-invasive sampling from the Asiatic wild ass (Equus hemionus) population in Israel. We genotyped 756 samples using 625 SNPs, of which 255 were triplicates of 85 samples. The average SNP error rate, calculated based on the number of mismatching genotypes across triplicates before filtration, was 0.0034 and was reduced to 0.00174 following filtration. Evaluating genetic distance (GD) and relatedness (r) between triplicates before and after filtration (expected to be at the minimum and maximum respectively) showed a significant reduction in the average GD, from 58.1 to 25.3 (p = 0.0002) and a significant increase in relatedness, from r = 0.98 to r = 0.991 (p = 0.00587). We demonstrate how error rate estimation enhances recapture detection and improves genotype quality.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14033"},"PeriodicalIF":5.5,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142454253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yannis Schöneberg, Tracy Lynn Audisio, Alexander Ben Hamadou, Martin Forman, Jiří Král, Tereza Kořínková, Eva Líznarová, Christoph Mayer, Lenka Prokopcová, Henrik Krehenwinkel, Stefan Prost, Susan Kennedy
Spiders are a hyperdiverse taxon and among the most abundant predators in nearly all terrestrial habitats. Their success is often attributed to key developments in their evolution such as silk and venom production and major apomorphies such as a whole-genome duplication. Resolving deep relationships within the spider tree of life has been historically challenging, making it difficult to measure the relative importance of these novelties for spider evolution. Whole-genome data offer an essential resource in these efforts, but also for functional genomic studies. Here, we present de novo assemblies for three spider species: Ryuthela nishihirai (Liphistiidae), a representative of the ancient Mesothelae, the suborder that is sister to all other extant spiders; Uloborus plumipes (Uloboridae), a cribellate orbweaver whose phylogenetic placement is especially challenging; and Cheiracanthium punctorium (Cheiracanthiidae), which represents only the second family to be sequenced in the hyperdiverse Dionycha clade. These genomes fill critical gaps in the spider tree of life. Using these novel genomes along with 25 previously published ones, we examine the evolutionary history of spidroin gene and structural hox cluster diversity. Our assemblies provide critical genomic resources to facilitate deeper investigations into spider evolution. The near chromosome-level genome of the 'living fossil' R. nishihirai represents an especially important step forward, offering new insights into the origins of spider traits.
{"title":"Three Novel Spider Genomes Unveil Spidroin Diversification and Hox Cluster Architecture: Ryuthela nishihirai (Liphistiidae), Uloborus plumipes (Uloboridae) and Cheiracanthium punctorium (Cheiracanthiidae).","authors":"Yannis Schöneberg, Tracy Lynn Audisio, Alexander Ben Hamadou, Martin Forman, Jiří Král, Tereza Kořínková, Eva Líznarová, Christoph Mayer, Lenka Prokopcová, Henrik Krehenwinkel, Stefan Prost, Susan Kennedy","doi":"10.1111/1755-0998.14038","DOIUrl":"https://doi.org/10.1111/1755-0998.14038","url":null,"abstract":"<p><p>Spiders are a hyperdiverse taxon and among the most abundant predators in nearly all terrestrial habitats. Their success is often attributed to key developments in their evolution such as silk and venom production and major apomorphies such as a whole-genome duplication. Resolving deep relationships within the spider tree of life has been historically challenging, making it difficult to measure the relative importance of these novelties for spider evolution. Whole-genome data offer an essential resource in these efforts, but also for functional genomic studies. Here, we present de novo assemblies for three spider species: Ryuthela nishihirai (Liphistiidae), a representative of the ancient Mesothelae, the suborder that is sister to all other extant spiders; Uloborus plumipes (Uloboridae), a cribellate orbweaver whose phylogenetic placement is especially challenging; and Cheiracanthium punctorium (Cheiracanthiidae), which represents only the second family to be sequenced in the hyperdiverse Dionycha clade. These genomes fill critical gaps in the spider tree of life. Using these novel genomes along with 25 previously published ones, we examine the evolutionary history of spidroin gene and structural hox cluster diversity. Our assemblies provide critical genomic resources to facilitate deeper investigations into spider evolution. The near chromosome-level genome of the 'living fossil' R. nishihirai represents an especially important step forward, offering new insights into the origins of spider traits.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14038"},"PeriodicalIF":5.5,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142454257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anders K Krabberød, Embla Stokke, Ella Thoen, Inger Skrede, Håvard Kauserud
Current rDNA reference sequence databases are tailored towards shorter DNA markers, such as parts of the 16/18S marker or the internally transcribed spacer (ITS) region. However, due to advances in long-read DNA sequencing technologies, longer stretches of the rDNA operon are increasingly used in environmental sequencing studies to increase the phylogenetic resolution. There is, therefore, a growing need for longer rDNA reference sequences. Here, we present the ribosomal operon database (ROD), which includes eukaryotic full-length rDNA operons fished from publicly available genome assemblies. Full-length operons were detected in 34.1% of the 34,701 examined eukaryotic genome assemblies from NCBI. In most cases (53.1%), more than one operon variant was detected, which can be due to intragenomic operon copy variability, allelic variation in non-haploid genomes, or technical errors from the sequencing and assembly process. The highest copy number found was 5947 in Zea mays. In total, 453,697 unique operons were detected, with 69,480 operon variant clusters remaining after intragenomic clustering at 99% sequence identity. The operon length varied extensively across eukaryotes, ranging from 4136 to 16,463 bp, which will lead to considerable polymerase chain reaction (PCR) bias during amplification of the entire operon. Clustering the full-length operons revealed that the different parts (i.e., 18S, 28S, and the hypervariable regions V4 and V9 of 18S) provide divergent taxonomic resolution, with 18S, the V4 and V9 regions being the most conserved. The ROD will be updated regularly to provide an increasing number of full-length rDNA operons to the scientific community.
{"title":"The Ribosomal Operon Database: A Full-Length rDNA Operon Database Derived From Genome Assemblies.","authors":"Anders K Krabberød, Embla Stokke, Ella Thoen, Inger Skrede, Håvard Kauserud","doi":"10.1111/1755-0998.14031","DOIUrl":"https://doi.org/10.1111/1755-0998.14031","url":null,"abstract":"<p><p>Current rDNA reference sequence databases are tailored towards shorter DNA markers, such as parts of the 16/18S marker or the internally transcribed spacer (ITS) region. However, due to advances in long-read DNA sequencing technologies, longer stretches of the rDNA operon are increasingly used in environmental sequencing studies to increase the phylogenetic resolution. There is, therefore, a growing need for longer rDNA reference sequences. Here, we present the ribosomal operon database (ROD), which includes eukaryotic full-length rDNA operons fished from publicly available genome assemblies. Full-length operons were detected in 34.1% of the 34,701 examined eukaryotic genome assemblies from NCBI. In most cases (53.1%), more than one operon variant was detected, which can be due to intragenomic operon copy variability, allelic variation in non-haploid genomes, or technical errors from the sequencing and assembly process. The highest copy number found was 5947 in Zea mays. In total, 453,697 unique operons were detected, with 69,480 operon variant clusters remaining after intragenomic clustering at 99% sequence identity. The operon length varied extensively across eukaryotes, ranging from 4136 to 16,463 bp, which will lead to considerable polymerase chain reaction (PCR) bias during amplification of the entire operon. Clustering the full-length operons revealed that the different parts (i.e., 18S, 28S, and the hypervariable regions V4 and V9 of 18S) provide divergent taxonomic resolution, with 18S, the V4 and V9 regions being the most conserved. The ROD will be updated regularly to provide an increasing number of full-length rDNA operons to the scientific community.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14031"},"PeriodicalIF":5.5,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142454256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One essential initial step in the analysis of ancient DNA is to authenticate that the DNA sequencing reads are actually from ancient DNA. This is done by assessing if the reads exhibit typical characteristics of post-mortem damage (PMD), including cytosine deamination and nicks. We present a novel statistical method implemented in a fast multithreaded programme, ngsBriggs that enables rapid quantification of PMD by estimation of the Briggs ancient damage model parameters (Briggs parameters). Using a multinomial model with maximum likelihood fit, ngsBriggs accurately estimates the parameters of the Briggs model, quantifying the PMD signal from single and double-stranded DNA regions. We extend the original Briggs model to capture PMD signals for contemporary sequencing platforms and show that ngsBriggs accurately estimates the Briggs parameters across a variety of contamination levels. Classification of reads into ancient or modern reads, for the purpose of decontamination, is significantly more accurate using ngsBriggs than using other methods available. Furthermore, ngsBriggs is substantially faster than other state-of-the-art methods. ngsBriggs offers a practical and accurate method for researchers seeking to authenticate ancient DNA and improve the quality of their data.
分析古 DNA 的一个重要初始步骤是鉴定 DNA 测序读数是否真的来自古 DNA。要做到这一点,需要评估读数是否表现出典型的死后损伤(PMD)特征,包括胞嘧啶脱氨和刻痕。我们介绍了一种在快速多线程程序 ngsBriggs 中实施的新型统计方法,该方法可通过估算布里格斯古损伤模型参数(布里格斯参数)快速量化 PMD。ngsBriggs 使用最大似然拟合的多项式模型,准确估计了布里格斯模型的参数,量化了单链和双链 DNA 区域的 PMD 信号。我们对原始布里格斯模型进行了扩展,以捕捉当代测序平台的 PMD 信号,结果表明 ngsBriggs 能准确估计各种污染水平下的布里格斯参数。与其他可用方法相比,使用 ngsBriggs 将读数分为古代读数和现代读数以达到净化目的的准确性要高得多。此外,ngsBriggs 比其他最先进的方法快得多。ngsBriggs 为寻求鉴定古代 DNA 和提高数据质量的研究人员提供了一种实用而准确的方法。
{"title":"Revisiting the Briggs Ancient DNA Damage Model: A Fast Maximum Likelihood Method to Estimate Post-Mortem Damage.","authors":"Lei Zhao, Rasmus Amund Henriksen, Abigail Ramsøe, Rasmus Nielsen, Thorfinn Sand Korneliussen","doi":"10.1111/1755-0998.14029","DOIUrl":"https://doi.org/10.1111/1755-0998.14029","url":null,"abstract":"<p><p>One essential initial step in the analysis of ancient DNA is to authenticate that the DNA sequencing reads are actually from ancient DNA. This is done by assessing if the reads exhibit typical characteristics of post-mortem damage (PMD), including cytosine deamination and nicks. We present a novel statistical method implemented in a fast multithreaded programme, ngsBriggs that enables rapid quantification of PMD by estimation of the Briggs ancient damage model parameters (Briggs parameters). Using a multinomial model with maximum likelihood fit, ngsBriggs accurately estimates the parameters of the Briggs model, quantifying the PMD signal from single and double-stranded DNA regions. We extend the original Briggs model to capture PMD signals for contemporary sequencing platforms and show that ngsBriggs accurately estimates the Briggs parameters across a variety of contamination levels. Classification of reads into ancient or modern reads, for the purpose of decontamination, is significantly more accurate using ngsBriggs than using other methods available. Furthermore, ngsBriggs is substantially faster than other state-of-the-art methods. ngsBriggs offers a practical and accurate method for researchers seeking to authenticate ancient DNA and improve the quality of their data.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14029"},"PeriodicalIF":5.5,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142454254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucas André Blattner, Pierre Lapellegerie, Colin Courtney-Mustaphi, Oliver Heiri
Chironomidae, so-called non-biting midges, are considered key bioindicators of aquatic ecosystem variability. Data derived from morphologically identifying their chitinous remains in sediments document chironomid larvae assemblages, which are studied to reconstruct ecosystem changes over time. Recent developments in sedimentary DNA (sedDNA) research have demonstrated that molecular techniques are suitable for determining past and present occurrences of organisms. Nevertheless, sedDNA records documenting alterations in chironomid assemblages remain largely unexplored. To close this gap, we examined the applicability of sedDNA metabarcoding to identify Chironomidae assemblages in lake sediments by sampling and processing three 21-35 cm long sediment cores from Lake Sempach in Switzerland. With a focus on developing analytical approaches, we compared an invertebrate-universal (FWH) and a newly designed Chironomidae-specific metabarcoding primer set (CH) to assess their performance in detecting Chironomidae DNA. We isolated and identified chitinous larval remains and compared the morphotype assemblages with the data derived from sedDNA metabarcoding. Results showed a good overall agreement of the morphotype assemblage-specific clustering among the chitinous remains and the metabarcoding datasets. Both methods indicated higher chironomid assemblage similarity between the two littoral cores in contrast to the deep lake core. Moreover, we observed a pronounced primer bias effect resulting in more Chironomidae detections with the CH primer combination compared to the FWH combination. Overall, we conclude that sedDNA metabarcoding can supplement traditional remain identifications and potentially provide independent reconstructions of past chironomid assemblage changes. Furthermore, it has the potential of more efficient workflows, better sample standardisation and species-level resolution datasets.
摇蚊(Chironomidae),即所谓的不咬蠓,被认为是水生生态系统变化的关键生物指标。通过对其在沉积物中的壳质残骸进行形态鉴定而获得的数据记录了摇蚊幼虫的组合,通过研究这些数据可以重建生态系统随时间的变化。沉积 DNA(sedDNA)研究的最新进展表明,分子技术适用于确定生物在过去和现在的分布情况。然而,记录摇蚊组合变化的沉积 DNA 记录在很大程度上仍未得到研究。为了填补这一空白,我们对瑞士森帕赫湖的三块 21-35 厘米长的沉积物岩心进行了取样和处理,研究了沉积 DNA 代谢编码技术在鉴定湖泊沉积物中摇蚊类群方面的适用性。为了开发分析方法,我们比较了无脊椎动物通用引物组(FWH)和新设计的摇蚊科专用代谢标码引物组(CH),以评估它们在检测摇蚊科 DNA 方面的性能。我们分离并鉴定了几丁质幼虫遗骸,并将其形态组合与沉积物 DNA 代谢编码得出的数据进行了比较。结果表明,壳质幼虫遗骸与代谢编码数据集之间的形态组合总体上非常一致。两种方法都表明,与深湖岩心相比,两个沿岸岩心的摇蚊集合相似度更高。此外,我们还观察到明显的引物偏差效应,与 FWH 引物组合相比,CH 引物组合检测到的摇蚊数量更多。总之,我们得出结论:沉积 DNA 代谢编码可以补充传统的残留鉴定,并有可能独立重建过去摇蚊类群的变化。此外,沉积 DNA 代谢编码还具有更高效的工作流程、更好的样本标准化和物种级分辨率数据集的潜力。
{"title":"Sediment Core DNA-Metabarcoding and Chitinous Remain Identification: Integrating Complementary Methods to Characterise Chironomidae Biodiversity in Lake Sediment Archives.","authors":"Lucas André Blattner, Pierre Lapellegerie, Colin Courtney-Mustaphi, Oliver Heiri","doi":"10.1111/1755-0998.14035","DOIUrl":"https://doi.org/10.1111/1755-0998.14035","url":null,"abstract":"<p><p>Chironomidae, so-called non-biting midges, are considered key bioindicators of aquatic ecosystem variability. Data derived from morphologically identifying their chitinous remains in sediments document chironomid larvae assemblages, which are studied to reconstruct ecosystem changes over time. Recent developments in sedimentary DNA (sedDNA) research have demonstrated that molecular techniques are suitable for determining past and present occurrences of organisms. Nevertheless, sedDNA records documenting alterations in chironomid assemblages remain largely unexplored. To close this gap, we examined the applicability of sedDNA metabarcoding to identify Chironomidae assemblages in lake sediments by sampling and processing three 21-35 cm long sediment cores from Lake Sempach in Switzerland. With a focus on developing analytical approaches, we compared an invertebrate-universal (FWH) and a newly designed Chironomidae-specific metabarcoding primer set (CH) to assess their performance in detecting Chironomidae DNA. We isolated and identified chitinous larval remains and compared the morphotype assemblages with the data derived from sedDNA metabarcoding. Results showed a good overall agreement of the morphotype assemblage-specific clustering among the chitinous remains and the metabarcoding datasets. Both methods indicated higher chironomid assemblage similarity between the two littoral cores in contrast to the deep lake core. Moreover, we observed a pronounced primer bias effect resulting in more Chironomidae detections with the CH primer combination compared to the FWH combination. Overall, we conclude that sedDNA metabarcoding can supplement traditional remain identifications and potentially provide independent reconstructions of past chironomid assemblage changes. Furthermore, it has the potential of more efficient workflows, better sample standardisation and species-level resolution datasets.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14035"},"PeriodicalIF":5.5,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142454255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandrine Daniel, Paul Savary, Jean-Christophe Foltête, Gilles Vuidel, Bruno Faivre, Stéphane Garnier, Aurélie Khimoun
Modelling population connectivity is central to biodiversity conservation and often relies on resistance surfaces reflecting multi-generational gene flow. ResistanceGA (RGA) is a common optimization framework for parameterizing these surfaces by maximizing the fit between genetic distances and cost distances using maximum likelihood population effect models. As the reliability of this framework has rarely been studied, we investigated the conditions maximizing its accuracy for both prediction and interpretation of landscape features' permeability. We ran demo-genetic simulations in contrasted landscapes for species with distinct dispersal capacities and specialization levels, using corresponding reference cost scenarios. We then optimized resistance surfaces from the simulated genetic distances using RGA. First, we evaluated whether RGA identified the drivers of the genetic patterns, that is, distinguished Isolation-by-Resistance (IBR) patterns from either Isolation-by-Distance or patterns unrelated to ecological distances. We then assessed RGA predictive performance using a cross-validation method, and its ability to recover the reference cost scenarios shaping genetic structure in simulations. IBR patterns were well detected and genetic distances were predicted with great accuracy. This performance depended on the strength of the genetic structuring, sampling design and landscape structure. Matching the scale of the genetic pattern by focusing on population pairs connected through gene flow and limiting overfitting through cross-validation further enhanced inference reliability. Yet, the optimized cost values often departed from the reference values, making their interpretation and extrapolation potentially dubious. While demonstrating the value of RGA for predictive modelling, we call for caution and provide additional guidance for its optimal use.
{"title":"What can optimized cost distances based on genetic distances offer? A simulation study on the use and misuse of ResistanceGA.","authors":"Alexandrine Daniel, Paul Savary, Jean-Christophe Foltête, Gilles Vuidel, Bruno Faivre, Stéphane Garnier, Aurélie Khimoun","doi":"10.1111/1755-0998.14024","DOIUrl":"https://doi.org/10.1111/1755-0998.14024","url":null,"abstract":"<p><p>Modelling population connectivity is central to biodiversity conservation and often relies on resistance surfaces reflecting multi-generational gene flow. ResistanceGA (RGA) is a common optimization framework for parameterizing these surfaces by maximizing the fit between genetic distances and cost distances using maximum likelihood population effect models. As the reliability of this framework has rarely been studied, we investigated the conditions maximizing its accuracy for both prediction and interpretation of landscape features' permeability. We ran demo-genetic simulations in contrasted landscapes for species with distinct dispersal capacities and specialization levels, using corresponding reference cost scenarios. We then optimized resistance surfaces from the simulated genetic distances using RGA. First, we evaluated whether RGA identified the drivers of the genetic patterns, that is, distinguished Isolation-by-Resistance (IBR) patterns from either Isolation-by-Distance or patterns unrelated to ecological distances. We then assessed RGA predictive performance using a cross-validation method, and its ability to recover the reference cost scenarios shaping genetic structure in simulations. IBR patterns were well detected and genetic distances were predicted with great accuracy. This performance depended on the strength of the genetic structuring, sampling design and landscape structure. Matching the scale of the genetic pattern by focusing on population pairs connected through gene flow and limiting overfitting through cross-validation further enhanced inference reliability. Yet, the optimized cost values often departed from the reference values, making their interpretation and extrapolation potentially dubious. While demonstrating the value of RGA for predictive modelling, we call for caution and provide additional guidance for its optimal use.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14024"},"PeriodicalIF":5.5,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142454258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J Antonio Baeza, Jeremiah J Minish, Todd P Michael
<p><p>Complete mitochondrial genomes have become markers of choice to explore phylogenetic relationships at multiple taxonomic levels and they are often assembled using whole genome short-read sequencing. Herein, using three species of sea chubs as an example, we explored the accuracy of mitochondrial chromosomes assembled using Oxford Nanopore Technology (ONT) Kit 14 R10.4.1 long reads at different sequencing depths (high, low and very low or genome skimming) by comparing them to 'gold' standard reference mitochondrial genomes assembled using Illumina NovaSeq short reads. In two species of sea chubs, Girella nigricans and Kyphosus azureus, ONT long-read assembled mitochondrial genomes at high sequencing depths (> 25× whole [nuclear] genome) were identical to their respective short-read assembled mitochondrial genomes. Not a single 'homopolymer insertion', 'homopolymer deletion', 'simple substitution', 'single insertion', 'short insertion', 'single deletion' or 'short deletion' were detected in the long-read assembled mitochondrial genomes after aligning each one of them to their short-read counterparts. In turn, in a third species, Medialuna californiensis, a 25× sequencing depth long-read assembled mitochondrial genome was 14 nucleotides longer than its short-read counterpart. The difference in total length between the latter two assemblies was due to the presence of a short motif 14 bp long that was repeated (twice) in the long read but not in the short-read assembly. Read subsampling at a sequencing depth of 1× resulted in the assembly of partial or complete mitochondrial genomes with numerous errors, including, among others, simple indels, and indels at homopolymer regions. At 3× and 5× subsampling, genomes were identical (perfect) or almost identical (quasiperfect, 99.5% over 16,500 bp) to their respective Illumina assemblies. The newly assembled mitochondrial genomes exhibit identical gene composition and organisation compared with cofamilial species and a phylomitogenomic analysis based on translated protein-coding genes suggested that the family Kyphosidae is not monophyletic. The same analysis detected possible cases of misidentification of mitochondrial genomes deposited in GenBank. This study demonstrates that perfect (complete and fully accurate) or quasiperfect (complete but with a single or a very few errors) mitochondrial genomes can be assembled at high (> 25×) and low (3-5×) but not very low (1×, genome skimming) sequencing depths using ONT long reads and the latest ONT chemistries (Kit 14 and R10.4.1 flowcells with SUP basecalling). The newly assembled and annotated mitochondrial genomes can be used as a reference in environmental DNA studies focusing on bioprospecting and biomonitoring of these and other coastal species experiencing environmental insult. Given the small size of the sequencing device and low cost, we argue that ONT technology has the potential to improve access to high-throughput sequencing technologies in low-
完整的线粒体基因组已成为探索多分类水平系统发育关系的首选标记,通常使用全基因组短读测序法组装线粒体基因组。在此,我们以三个物种的海鲦为例,通过与使用 Illumina NovaSeq 短读取组装的 "黄金 "标准参考线粒体基因组进行比较,探讨了使用牛津纳米孔技术(ONT)14 R10.4.1 套件在不同测序深度(高、低和极低或基因组撇取)下组装的线粒体染色体的准确性。在两种海鲦(Girella nigricans 和 Kyphosus azureus)中,高测序深度(> 25× 全[核]基因组)下 ONT 长读数组装的线粒体基因组与各自短读数组装的线粒体基因组完全相同。将长线程组装的线粒体基因组与短线程组装的线粒体基因组进行比对后,没有发现任何 "同源多聚物插入"、"同源多聚物缺失"、"简单替换"、"单插入"、"短插入"、"单缺失 "或 "短缺失"。而在第三个物种--加州麦地那龙(Medialuna californiensis)中,25 倍测序深度的长线粒体基因组比短线粒体基因组长 14 个核苷酸。后两种装配的总长度之所以不同,是因为存在一个 14 bp 长的短图案,该图案在长读数中重复(两次),但在短读数装配中没有。在测序深度为 1× 的情况下,读数子取样会导致部分或完整线粒体基因组的组装出现大量错误,其中包括简单嵌合和同源多聚物区域的嵌合。在 3 倍和 5 倍子取样时,基因组与各自的 Illumina 组装结果完全相同(完美)或几乎完全相同(准完美,在 16,500 bp 上达到 99.5%)。与同族物种相比,新组装的线粒体基因组显示出相同的基因组成和组织结构,而基于翻译蛋白编码基因的系统发生组分析表明,Kyphosidae科并非单系。同样的分析还发现了存放在 GenBank 中的线粒体基因组可能存在识别错误的情况。这项研究表明,使用 ONT 长读数和最新的 ONT 化学试剂(Kit 14 和 R10.4.1 flowcells,带 SUP basecalling),可以在高测序深度(> 25×)和低测序深度(3-5×)但不是极低测序深度(1×,基因组略读)下组装出完美(完整且完全准确)或准完美(完整但只有一个或极少数错误)的线粒体基因组。新组装和注释的线粒体基因组可作为环境 DNA 研究的参考,重点是这些物种和其他遭受环境污染的沿海物种的生物勘探和生物监测。鉴于测序装置体积小、成本低,我们认为 ONT 技术有可能改善中低收入国家对高通量测序技术的利用。
{"title":"Assembly of Mitochondrial Genomes Using Nanopore Long-Read Technology in Three Sea Chubs (Teleostei: Kyphosidae).","authors":"J Antonio Baeza, Jeremiah J Minish, Todd P Michael","doi":"10.1111/1755-0998.14034","DOIUrl":"https://doi.org/10.1111/1755-0998.14034","url":null,"abstract":"<p><p>Complete mitochondrial genomes have become markers of choice to explore phylogenetic relationships at multiple taxonomic levels and they are often assembled using whole genome short-read sequencing. Herein, using three species of sea chubs as an example, we explored the accuracy of mitochondrial chromosomes assembled using Oxford Nanopore Technology (ONT) Kit 14 R10.4.1 long reads at different sequencing depths (high, low and very low or genome skimming) by comparing them to 'gold' standard reference mitochondrial genomes assembled using Illumina NovaSeq short reads. In two species of sea chubs, Girella nigricans and Kyphosus azureus, ONT long-read assembled mitochondrial genomes at high sequencing depths (> 25× whole [nuclear] genome) were identical to their respective short-read assembled mitochondrial genomes. Not a single 'homopolymer insertion', 'homopolymer deletion', 'simple substitution', 'single insertion', 'short insertion', 'single deletion' or 'short deletion' were detected in the long-read assembled mitochondrial genomes after aligning each one of them to their short-read counterparts. In turn, in a third species, Medialuna californiensis, a 25× sequencing depth long-read assembled mitochondrial genome was 14 nucleotides longer than its short-read counterpart. The difference in total length between the latter two assemblies was due to the presence of a short motif 14 bp long that was repeated (twice) in the long read but not in the short-read assembly. Read subsampling at a sequencing depth of 1× resulted in the assembly of partial or complete mitochondrial genomes with numerous errors, including, among others, simple indels, and indels at homopolymer regions. At 3× and 5× subsampling, genomes were identical (perfect) or almost identical (quasiperfect, 99.5% over 16,500 bp) to their respective Illumina assemblies. The newly assembled mitochondrial genomes exhibit identical gene composition and organisation compared with cofamilial species and a phylomitogenomic analysis based on translated protein-coding genes suggested that the family Kyphosidae is not monophyletic. The same analysis detected possible cases of misidentification of mitochondrial genomes deposited in GenBank. This study demonstrates that perfect (complete and fully accurate) or quasiperfect (complete but with a single or a very few errors) mitochondrial genomes can be assembled at high (> 25×) and low (3-5×) but not very low (1×, genome skimming) sequencing depths using ONT long reads and the latest ONT chemistries (Kit 14 and R10.4.1 flowcells with SUP basecalling). The newly assembled and annotated mitochondrial genomes can be used as a reference in environmental DNA studies focusing on bioprospecting and biomonitoring of these and other coastal species experiencing environmental insult. Given the small size of the sequencing device and low cost, we argue that ONT technology has the potential to improve access to high-throughput sequencing technologies in low-","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14034"},"PeriodicalIF":5.5,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142454250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul D N Hebert, Robin Floyd, Saeideh Jafarpour, Sean W J Prosser
It is a global priority to better manage the biosphere, but action must be informed by comprehensive data on the abundance and distribution of species. The acquisition of such information is currently constrained by high costs. DNA barcoding can speed the registration of unknown animal species, the most diverse kingdom of eukaryotes, as the BIN system automates their recognition. However, inexpensive sequencing protocols are critical as the census of all animal species is likely to require the analysis of a billion or more specimens. Barcoding involves DNA extraction followed by PCR and sequencing with the last step dominating costs until 2017. By enabling the sequencing of highly multiplexed samples, the Sequel platforms from Pacific BioSciences slashed costs by 90%, but these instruments are only deployed in core facilities because of their expense. Sequencers from Oxford Nanopore Technologies provide an escape from high capital and service costs, but their low sequence fidelity has, until recently, constrained adoption. However, the improved performance of its latest flow cells (R10.4.1) erases this barrier. This study demonstrates that a MinION flow cell can characterise an amplicon pool derived from 100,000 specimens while a Flongle flow cell can process one derived from several thousand. At $0.01 per specimen, DNA sequencing is now the least expensive step in the barcode workflow.
{"title":"Barcode 100K Specimens: In a Single Nanopore Run.","authors":"Paul D N Hebert, Robin Floyd, Saeideh Jafarpour, Sean W J Prosser","doi":"10.1111/1755-0998.14028","DOIUrl":"https://doi.org/10.1111/1755-0998.14028","url":null,"abstract":"<p><p>It is a global priority to better manage the biosphere, but action must be informed by comprehensive data on the abundance and distribution of species. The acquisition of such information is currently constrained by high costs. DNA barcoding can speed the registration of unknown animal species, the most diverse kingdom of eukaryotes, as the BIN system automates their recognition. However, inexpensive sequencing protocols are critical as the census of all animal species is likely to require the analysis of a billion or more specimens. Barcoding involves DNA extraction followed by PCR and sequencing with the last step dominating costs until 2017. By enabling the sequencing of highly multiplexed samples, the Sequel platforms from Pacific BioSciences slashed costs by 90%, but these instruments are only deployed in core facilities because of their expense. Sequencers from Oxford Nanopore Technologies provide an escape from high capital and service costs, but their low sequence fidelity has, until recently, constrained adoption. However, the improved performance of its latest flow cells (R10.4.1) erases this barrier. This study demonstrates that a MinION flow cell can characterise an amplicon pool derived from 100,000 specimens while a Flongle flow cell can process one derived from several thousand. At $0.01 per specimen, DNA sequencing is now the least expensive step in the barcode workflow.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14028"},"PeriodicalIF":5.5,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142454251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jessen Havill, Olivia Strasburg, Tessy Udoh, Jacob E Crawford, Andrea Gloria-Soria
Eukaryotic genomes harbour sequences derived from non-retroviral RNA viruses, known as endogenous viral elements (EVEs) or non-retroviral integrated RNA virus sequences (NIRVS). These sequences represent a record of past infections and have been implicated in host anti-viral response. We have created a program to identify viral sequences integrated in a host genome. It begins with a specimen BAM file and outputs candidate NIRVS, along with putative host insertion sites and overlapping genomic features of the host genome in XML and visual formats, with minimal intermediary intervention. We ran through this software short-read data derived from the genomes of 222 wild-caught A. aegypti mosquitoes, from a dozen geographical regions, and located putative NIRVS from seven virus families. This program is as accurate as currently available software for NIRVS detection, and represents a significant improvement in adaptability and user-friendliness. Furthermore, the flexibility of this pipeline allows the user to search for sequence integrations across the genome of any organism, as long as a query sequence database and a reference genome is provided. Potential extended applications include identification of integrated transgenic sequences used for research or vector control strategies.
{"title":"EVE-X: Software to Identify Novel Viral Insertions in Wild-Caught Arthropod Hosts From Next-Generation Short Read Data.","authors":"Jessen Havill, Olivia Strasburg, Tessy Udoh, Jacob E Crawford, Andrea Gloria-Soria","doi":"10.1111/1755-0998.14026","DOIUrl":"https://doi.org/10.1111/1755-0998.14026","url":null,"abstract":"<p><p>Eukaryotic genomes harbour sequences derived from non-retroviral RNA viruses, known as endogenous viral elements (EVEs) or non-retroviral integrated RNA virus sequences (NIRVS). These sequences represent a record of past infections and have been implicated in host anti-viral response. We have created a program to identify viral sequences integrated in a host genome. It begins with a specimen BAM file and outputs candidate NIRVS, along with putative host insertion sites and overlapping genomic features of the host genome in XML and visual formats, with minimal intermediary intervention. We ran through this software short-read data derived from the genomes of 222 wild-caught A. aegypti mosquitoes, from a dozen geographical regions, and located putative NIRVS from seven virus families. This program is as accurate as currently available software for NIRVS detection, and represents a significant improvement in adaptability and user-friendliness. Furthermore, the flexibility of this pipeline allows the user to search for sequence integrations across the genome of any organism, as long as a query sequence database and a reference genome is provided. Potential extended applications include identification of integrated transgenic sequences used for research or vector control strategies.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14026"},"PeriodicalIF":5.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142386792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}