MCHelper automatically curates transposable element libraries across eukaryotic species

IF 5.5 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Genome research Pub Date : 2024-12-09 DOI:10.1101/gr.278821.123

Simon Orozco-Arias, Pío Sierra, Richard Durbin, Josefa González

{"title":"MCHelper automatically curates transposable element libraries across eukaryotic species","authors":"Simon Orozco-Arias, Pío Sierra, Richard Durbin, Josefa González","doi":"10.1101/gr.278821.123","DOIUrl":null,"url":null,"abstract":"The number of species with high-quality genome sequences continues to increase, in part due to the scaling up of multiple large-scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element (TE) sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of TE sequences is still technically challenging. Several de novo TE identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high-quality genome annotations. Manual curation is time-consuming, and thus impractical for large-scale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool MCHelper, which automates the TE library curation process. By leveraging MCHelper's fully automated mode with the outputs from three de novo TE identification tools, RepeatModeler2, EDTA, and REPET, in the fruit fly, rice, hooded crow, zebrafish, maize, and human, we show a substantial improvement in the quality of the TE libraries and genome annotations. MCHelper libraries are less redundant, with up to 65% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and up to ∼48% fewer “unclassified/unknown” TE consensus sequences. Genome-wide TE annotations are also improved, including larger unfragmented insertions. Moreover, MCHelper is an easy-to-install and easy-to-use tool.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"20 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/gr.278821.123","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The number of species with high-quality genome sequences continues to increase, in part due to the scaling up of multiple large-scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element (TE) sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of TE sequences is still technically challenging. Several de novo TE identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high-quality genome annotations. Manual curation is time-consuming, and thus impractical for large-scale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool MCHelper, which automates the TE library curation process. By leveraging MCHelper's fully automated mode with the outputs from three de novo TE identification tools, RepeatModeler2, EDTA, and REPET, in the fruit fly, rice, hooded crow, zebrafish, maize, and human, we show a substantial improvement in the quality of the TE libraries and genome annotations. MCHelper libraries are less redundant, with up to 65% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and up to ∼48% fewer “unclassified/unknown” TE consensus sequences. Genome-wide TE annotations are also improved, including larger unfragmented insertions. Moreover, MCHelper is an easy-to-install and easy-to-use tool.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MCHelper自动管理跨真核生物物种的转座因子库

具有高质量基因组序列的物种数量持续增加，部分原因是多个大规模生物多样性测序项目的规模扩大。虽然对这些基因组中的基因序列进行注释的必要性已得到广泛认可，但对转座元件（TE）序列进行注释的平行需求已被证明可以改变基因组结构，重新连接基因调控网络，并有助于宿主性状的进化，这一点正变得越来越明显。然而，准确的TE序列全基因组注释在技术上仍然具有挑战性。现在有几个全新的基因组鉴定工具可用，但需要对这些工具产生的文库进行手动管理，以生成高质量的基因组注释。人工管理是耗时的，因此不适合大规模的基因组研究，并且缺乏可重复性。在这项工作中，我们介绍了手动策展人助手工具MCHelper，它可以自动化TE图书馆策展过程。通过利用MCHelper的全自动模式，利用三种全新的TE鉴定工具（RepeatModeler2、EDTA和REPET）在果蝇、水稻、蒙头乌鸦、斑马鱼、玉米和人类中的输出，我们展示了TE文库和基因组注释质量的实质性提高。MCHelper文库的冗余度较低，一致性序列的数量减少了65%，假阳性序列减少了11.4%，“未分类/未知”TE一致性序列减少了48%。全基因组TE注释也得到了改进，包括更大的非片段插入。此外，MCHelper是一个易于安装和使用的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Genome research 生物-生化与分子生物学

CiteScore

12.40

自引率

1.40%

发文量

140

审稿时长

6 months

期刊介绍： Launched in 1995, Genome Research is an international, continuously published, peer-reviewed journal that focuses on research that provides novel insights into the genome biology of all organisms, including advances in genomic medicine. Among the topics considered by the journal are genome structure and function, comparative genomics, molecular evolution, genome-scale quantitative and population genetics, proteomics, epigenomics, and systems biology. The journal also features exciting gene discoveries and reports of cutting-edge computational biology and high-throughput methodologies. New data in these areas are published as research papers, or methods and resource reports that provide novel information on technologies or tools that will be of interest to a broad readership. Complete data sets are presented electronically on the journal''s web site where appropriate. The journal also provides Reviews, Perspectives, and Insight/Outlook articles, which present commentary on the latest advances published both here and elsewhere, placing such progress in its broader biological context.