MCHelper automatically curates transposable element libraries across eukaryotic species

IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Genome research Pub Date : 2024-12-09 DOI:10.1101/gr.278821.123
Simon Orozco-Arias, Pío Sierra, Richard Durbin, Josefa González
{"title":"MCHelper automatically curates transposable element libraries across eukaryotic species","authors":"Simon Orozco-Arias, Pío Sierra, Richard Durbin, Josefa González","doi":"10.1101/gr.278821.123","DOIUrl":null,"url":null,"abstract":"The number of species with high-quality genome sequences continues to increase, in part due to the scaling up of multiple large-scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element (TE) sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of TE sequences is still technically challenging. Several de novo TE identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high-quality genome annotations. Manual curation is time-consuming, and thus impractical for large-scale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool MCHelper, which automates the TE library curation process. By leveraging MCHelper's fully automated mode with the outputs from three de novo TE identification tools, RepeatModeler2, EDTA, and REPET, in the fruit fly, rice, hooded crow, zebrafish, maize, and human, we show a substantial improvement in the quality of the TE libraries and genome annotations. MCHelper libraries are less redundant, with up to 65% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and up to ∼48% fewer “unclassified/unknown” TE consensus sequences. Genome-wide TE annotations are also improved, including larger unfragmented insertions. Moreover, MCHelper is an easy-to-install and easy-to-use tool.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"20 1","pages":""},"PeriodicalIF":6.2000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/gr.278821.123","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

The number of species with high-quality genome sequences continues to increase, in part due to the scaling up of multiple large-scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element (TE) sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of TE sequences is still technically challenging. Several de novo TE identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high-quality genome annotations. Manual curation is time-consuming, and thus impractical for large-scale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool MCHelper, which automates the TE library curation process. By leveraging MCHelper's fully automated mode with the outputs from three de novo TE identification tools, RepeatModeler2, EDTA, and REPET, in the fruit fly, rice, hooded crow, zebrafish, maize, and human, we show a substantial improvement in the quality of the TE libraries and genome annotations. MCHelper libraries are less redundant, with up to 65% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and up to ∼48% fewer “unclassified/unknown” TE consensus sequences. Genome-wide TE annotations are also improved, including larger unfragmented insertions. Moreover, MCHelper is an easy-to-install and easy-to-use tool.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MCHelper自动管理跨真核生物物种的转座因子库
具有高质量基因组序列的物种数量持续增加,部分原因是多个大规模生物多样性测序项目的规模扩大。虽然对这些基因组中的基因序列进行注释的必要性已得到广泛认可,但对转座元件(TE)序列进行注释的平行需求已被证明可以改变基因组结构,重新连接基因调控网络,并有助于宿主性状的进化,这一点正变得越来越明显。然而,准确的TE序列全基因组注释在技术上仍然具有挑战性。现在有几个全新的基因组鉴定工具可用,但需要对这些工具产生的文库进行手动管理,以生成高质量的基因组注释。人工管理是耗时的,因此不适合大规模的基因组研究,并且缺乏可重复性。在这项工作中,我们介绍了手动策展人助手工具MCHelper,它可以自动化TE图书馆策展过程。通过利用MCHelper的全自动模式,利用三种全新的TE鉴定工具(RepeatModeler2、EDTA和REPET)在果蝇、水稻、蒙头乌鸦、斑马鱼、玉米和人类中的输出,我们展示了TE文库和基因组注释质量的实质性提高。MCHelper文库的冗余度较低,一致性序列的数量减少了65%,假阳性序列减少了11.4%,“未分类/未知”TE一致性序列减少了48%。全基因组TE注释也得到了改进,包括更大的非片段插入。此外,MCHelper是一个易于安装和使用的工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Genome research
Genome research 生物-生化与分子生物学
CiteScore
12.40
自引率
1.40%
发文量
140
审稿时长
6 months
期刊介绍: Launched in 1995, Genome Research is an international, continuously published, peer-reviewed journal that focuses on research that provides novel insights into the genome biology of all organisms, including advances in genomic medicine. Among the topics considered by the journal are genome structure and function, comparative genomics, molecular evolution, genome-scale quantitative and population genetics, proteomics, epigenomics, and systems biology. The journal also features exciting gene discoveries and reports of cutting-edge computational biology and high-throughput methodologies. New data in these areas are published as research papers, or methods and resource reports that provide novel information on technologies or tools that will be of interest to a broad readership. Complete data sets are presented electronically on the journal''s web site where appropriate. The journal also provides Reviews, Perspectives, and Insight/Outlook articles, which present commentary on the latest advances published both here and elsewhere, placing such progress in its broader biological context.
期刊最新文献
Diffusion-based generation of gene regulatory networks from scRNA-seq data with DigNet. Kernel-bounded clustering for spatial transcriptomics enables scalable discovery of complex spatial domains The additional diagnostic yield of long-read sequencing in undiagnosed rare diseases k-mer approaches for biodiversity genomics Enhancing nanopore adaptive sampling for PromethION using readfish at scale
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1