Public databases of protein sequences, such as the National Center for Biotechnology Information (NCBI) Protein repository and UniProt, contain millions of proteins identified in samples from specific species but named as uncharacterized or hypothetical due to a lack of information about their function. Many such sequences are actually derived from RNA viruses, either due to viral infection of the original sample, contamination, or endogenous viral elements (EVEs) integrated into the genome of the sample species. Many proteins from RNA virus discovery research are also deposited into these repositories but are labelled as uncharacterized and only classified taxonomically at a superkingdom or realm level. Sequences from protein repositories not labelled specifically as being derived from the RNA-viral RNA-dependent RNA polymerase (RdRp) protein are often used as negative controls when looking to identify viral RdRp sequences, so the presence of unlabelled viruses amongst these datasets is problematic. These sequences also represent a source of information about novel viruses and EVEs. In this study, we screened uncharacterized proteins from two large public protein repositories-NCBI Protein and UniProt-to identify sequences likely to be derived from RNA viral RdRp and to perform detailed characterization of sequences of interest. We identified 3560 such sequences, many derived from EVEs. Many are previously unknown EVEs, which led to characterization of additional, related sequences. For example, a group of orbi-like viruses infecting nematodes was uncovered that appears to have both ancient endogenous and circulating exogenous members. Many integrations of mito-like viruses into plant genomes were also found. In several host taxonomic groups, the first example of an EVE, and in some cases the first example of any RNA virus, was uncovered. The large number of EVEs uncovered by this relatively small-scale search suggests that only a fraction of the true diversity of EVEs is currently known. We also provide provisional taxonomic annotations for RdRps, currently only listed as members of the Riboviria realm. A number of sequences are identified that are indistinguishable from viruses but are labelled as bacteria, seemingly as a result of mislabelling or contamination. Non-RdRp sequences that share near-significant similarity with RdRp are also characterized. Finally, recommendations are made for generating useful negative controls and sets of negative control sequences are provided.
扫码关注我们
求助内容:
应助结果提醒方式:
