Genome annotations provide the essential framework for genomic analyses, capturing our current knowledge of gene structure and function as inferred from computational predictions and experimental evidence. Even as automated annotation pipelines become more sophisticated, their accuracy in representing unconventional gene expression events remains largely untested. Here, we address this gap by examining the most common form of translational recoding: the insertion of selenocysteine (Sec), a non-canonical amino acid incorporated into selenoproteins, oxidoreductase enzymes carrying essential roles in redox homeostasis. Sec insertion occurs in response to UGA, normally interpreted as stop codon, but recoded in selenoprotein mRNAs. Owing to the dual function of UGA, the identification of selenoprotein genes poses a challenge. We show that the vertebrate selenoprotein genes are widely misannotated in major public databases. Only 11% and 5% of selenoprotein genes are well annotated in Ensembl and NCBI GenBank, respectively, due to the lack of dedicated selenoprotein annotation pipelines. In most cases (81% and 84%), overlapping flawed annotations are present which lack the Sec-encoding UGA. In contrast, NCBI RefSeq employs a dedicated selenoprotein pipeline, yet with some shortcomings: its selenoprotein annotations are correct in 77% of cases, and most errors affect families with a C-terminal Sec residue. We argue that selenoproteins must be correctly annotated in public databases and that must occur via automated pipelines, to keep the pace with genome sequencing. To facilitate this task, we present a new version of Selenoprofiles, an homology based tool for selenoprotein prediction that produces predictions with accuracy comparable to manual curation, and can be easily deployed and integrated in existing annotation pipelines.
扫码关注我们
求助内容:
应助结果提醒方式:
