Fish ear bones, known as otoliths, are often collected in fisheries to assist in management, and are a common sample type in museum and national archives. Beyond their utility for ageing, morphological and trace element analysis, otoliths are a repository of valuable genomic information. Previous work has shown that DNA can be extracted from the trace quantities of tissue remaining on the surface of otoliths, despite the fact that they are often stored dry at room temperature. However, much of this work has used reduced representation sequencing methods in clean lab conditions, to achieve adequate yields of DNA, libraries and ultimately single-nucleotide polymorphisms (SNPs). Here, we pioneer the use of small-scale (spike-in) sequencing to screen contemporary otolith samples prepared in regular molecular biology (in contrast to clean) laboratories for contamination and quality levels, submitting for whole-genome resequencing only samples above a defined endogenous DNA threshold. Despite the typically low quality and quantity of DNA extracted from otoliths, we are able to produce whole-genome libraries and ultimately sets of filtered, unlinked and even putatively adaptive SNPs of ample numbers for downstream uses in population, climate and conservation genomics. By comparing with a set of tissue samples from the same species, we are able to highlight the quality and efficacy of otolith samples from DNA extraction and library preparation, to bioinformatic preprocessing and SNP calling. We provide detailed schematics, protocols and scripts of our approach, such that it can be adopted widely by the community, improving the use of otoliths as a source of valuable genomic data.
Leveraging past allele frequencies has proven to be key for identifying the impact of natural selection across time. However, this approach suffers from imprecise estimations of the intensity (s) and timing (T) of selection, particularly when ancient samples are scarce in specific epochs. Here, we aimed to bypass the computation of allele frequencies across arbitrarily defined past epochs and refine the estimations of selection parameters by implementing convolutional neural networks (CNNs) algorithms that directly use ancient genotypes sampled across time. Using computer simulations, we first show that genotype-based CNNs consistently outperform an approximate Bayesian computation (ABC) approach based on past allele frequency trajectories, regardless of the selection model assumed and the number of available ancient genotypes. When applying this method to empirical data from modern and ancient Europeans, we replicated the reported increased number of selection events in post-Neolithic Europe, independently of the continental subregion studied. Furthermore, we substantially refined the ABC-based estimations of s and T for a set of positively and negatively selected variants, including iconic cases of positive selection and experimentally validated disease-risk variants. Our CNN predictions support a history of recent positive and negative selection targeting variants associated with host defence against pathogens, aligning with previous work that highlights the significant impact of infectious diseases, such as tuberculosis, in Europe. These findings collectively demonstrate that detecting the footprints of natural selection on ancient genomes is crucial for unravelling the history of severe human diseases.
Rapid environmental change poses unprecedented challenges to species persistence. To understand the extent that continued change could have, genomic offset methods have been used to forecast maladaptation of natural populations to future environmental change. However, while their use has become increasingly common, little is known regarding their predictive performance across a wide array of realistic and challenging scenarios. Here, we evaluate the performance of currently available offset methods (gradientForest, the Risk-Of-Non-Adaptedness, redundancy analysis with and without structure correction and LFMM2) using an extensive set of simulated data sets that vary demography, adaptive architecture and the number and spatial patterns of adaptive environments. For each data set, we train models using either all, adaptive or neutral marker sets and evaluate performance using in silico common gardens by correlating known fitness with projected offset. Using over 4,849,600 of such evaluations, we find that (1) method performance is largely due to the degree of local adaptation across the metapopulation (LA), (2) adaptive marker sets provide minimal performance advantages, (3) performance within the species range is variable across gardens and declines when offset models are trained using additional non-adaptive environments and (4) despite (1) performance declines more rapidly in globally novel climates (i.e. a climate without an analogue within the species range) for metapopulations with greater LA than lesser LA. We discuss the implications of these results for management, assisted gene flow and assisted migration.
Environmental DNA (eDNA) preserved in marine sediments is increasingly being used to study past ecosystems. However, little is known about how accurately marine biodiversity is recorded in sediment eDNA archives, especially planktonic taxa. Here, we address this question by comparing eukaryotic diversity in 273 eDNA samples from three water depths and the surface sediments of 24 stations in the Nordic Seas. Analysis of 18S-V9 metabarcoding data reveals distinct eukaryotic assemblages between water and sediment eDNA. Only 40% of Amplicon Sequence Variants (ASVs) detected in water were also found in sediment eDNA. Remarkably, the ASVs shared between water and sediment accounted for 80% of total sequence reads suggesting that a large amount of plankton DNA is transported to the seafloor, predominantly from abundant phytoplankton taxa. However, not all plankton taxa were equally archived on the seafloor. The plankton DNA deposited in the sediments was dominated by diatoms and showed an underrepresentation of certain nano- and picoplankton taxa (Picozoa or Prymnesiophyceae). Our study offers the first insights into the patterns of plankton diversity recorded in sediment in relation to seasonality and spatial variability of environmental conditions in the Nordic Seas. Our results suggest that the genetic composition and structure of the plankton community vary considerably throughout the water column and differ from what accumulates in the sediment. Hence, the interpretation of sedimentary eDNA archives should take into account potential taxonomic and abundance biases when reconstructing past changes in marine biodiversity.
The use of environmental DNA (eDNA) is becoming prevalent as a novel method of ecological monitoring. Although eDNA can provide critical information on the distribution and biomass of particular taxa, the DNA sequences of an organism remain unaltered throughout its existence, which complicates the accurate identification of crucial events, including spawning. Therefore, we examined DNA methylation as a novel source of information from eDNA, considering that the methylation patterns in eggs and sperm released during spawning differ from those of somatic tissues. Despite its potential applications, little is known about eDNA methylation, including its stability and methods for detection and quantification. Therefore, we conducted tank experiments and performed methylation analysis targeting 18S rDNA through bisulphite amplicon sequencing. In the target region, eDNA methylation was not affected by degradation and was equivalent to the methylation rate of genomic DNA from somatic tissues. Unmethylated DNA, abundant in the ovaries, was detected in the eDNA released during fish spawning. These results indicate that eDNA methylation is a stable signal reflecting targeted gene methylation and further demonstrate that germ cell-specific methylation patterns can be used as markers for detecting fish spawning.
Field-collected specimens were used to obtain nine high-quality genome assemblies from a total of 10 insect species native to prairies and savannas of central Illinois (USA): Mellilla xanthometata (Lepidoptera: Geometridae), Stenolophus ochropezus (Coleoptera: Carabidae), Forcipata loca (Hemiptera: Cicadellidae), Coelinius sp. (Hymenoptera: Braconidae), Thaumatomyia glabra (Diptera: Chloropidae), Brachynemurus abdominalus (Neuroptera: Myrmeleontidae), Catonia carolina (Hemiptera: Achilidae), Oncometopia orbona (Hemiptera: Cicadellidae), Flexamia atlantica (Hemiptera: Cicadellidae) and Stictocephala bisonia (Hemiptera: Membracidae). Sequencing library preparation from single specimens was successful despite extremely small DNA yields (<0.1 μg) for some samples. Additional sequencing and assembly workflows were adapted to each sample depending on the initial DNA yield. PacBio circular consensus (CCS/HiFi) or continuous long reads (CLR) libraries were used to sequence DNA fragments up to 50 kb in length, with Illumina sequenced linked-reads (TellSeq libraries) and Omni-C libraries used for scaffolding and gap-filling. Assembled genome sizes ranged from 135 MB to 3.2 GB. The number of assembled scaffolds ranged from 47 to >13,000, with the longest scaffold per assembly ranging from ~23 to 439 Mb. Genome completeness was high, with BUSCO scores ranging from 85.5% completeness for the largest genome (Stictocephala bisonia) to 98.8% completeness for the smallest genome (Coelinius sp.). The unique content was estimated using RepeatMasker and GenomeScope2, which ranged from 50.7% to 75.8% and roughly decreased with increasing genome size. Structural annotation predicted a range of 19,281-72,469 protein models for sequenced species. Sequencing costs per genome at the time ranged from US$3-5k, averaged ~1600 CPU-hours on a high-performance cluster and required approximately 14 h of bioinformatics analyses with samples using PacBio HiFi data. Most assemblies would benefit from further manual curation to correct possible scaffold misjoins and translocations suggested by off-diagonal or depleted signals in Omni-C contact maps.
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and barriers to dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity-by-descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN.
Efficient and accurate classification of DNA barcode data is crucial for large-scale fungal biodiversity studies. However, existing methods are either computationally expensive or lack accuracy. Previous research has demonstrated the potential of deep learning in this domain, successfully training neural networks for biological sequence classification. We introduce the MycoAI Python package, featuring various deep learning models such as BERT and CNN tailored for fungal Internal Transcribed Spacer (ITS) sequences. We explore different neural architecture designs and encoding methods to identify optimal models. By employing a multi-head output architecture and multi-level hierarchical label smoothing, MycoAI effectively generalizes across the taxonomic hierarchy. Using over 5 million labelled sequences from the UNITE database, we develop two models: MycoAI-BERT and MycoAI-CNN. While we emphasize the necessity of verifying classification results by AI models due to insufficient reference data, MycoAI still exhibits substantial potential. When benchmarked against existing classifiers such as DNABarcoder and RDP on two independent test sets with labels present in the training dataset, MycoAI models demonstrate high accuracy at the genus and higher taxonomic levels, with MycoAI-CNN being the fastest and most accurate. In terms of efficiency, MycoAI models can classify over 300,000 sequences within 5 min. We publicly release the MycoAI models, enabling mycologists to classify their ITS barcode data efficiently. Additionally, MycoAI serves as a platform for developing further deep learning-based classification methods. The source code for MycoAI is available under the MIT Licence at https://github.com/MycoAI/MycoAI.