MARS and RNAcmap3: The Master Database of All Possible RNA Sequences Integrated with RNAcmap for RNA Homology Search

Ke'ai Chen, Thomas Litfin, Jaswinder Singh, Jian Zhan, Yaoqi Zhou
{"title":"MARS and RNAcmap3: The Master Database of All Possible RNA Sequences Integrated with RNAcmap for RNA Homology Search","authors":"Ke'ai Chen, Thomas Litfin, Jaswinder Singh, Jian Zhan, Yaoqi Zhou","doi":"10.1093/gpbjnl/qzae018","DOIUrl":null,"url":null,"abstract":"\n Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by including the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to nucleotide database (nt) and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI’s nt database or 60-fold larger than RNAcentral. The new dataset along with a new split–search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037 and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.","PeriodicalId":170516,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"114 35","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics, Proteomics & Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/gpbjnl/qzae018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by including the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to nucleotide database (nt) and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI’s nt database or 60-fold larger than RNAcentral. The new dataset along with a new split–search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037 and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MARS 和 RNAcmap3:所有可能 RNA 序列的主数据库,与 RNAcmap 集成用于 RNA 同源搜索
最近,AlphaFold2 在蛋白质结构预测方面的成功在很大程度上依赖于从庞大的蛋白质序列综合数据库(Big Fantastic Database)中发现的同源蛋白质序列中获得的协同进化信息。与此相反,现有的核苷酸数据库并没有进行整合,因此无法进行更广泛、更深入的同源搜索。在这里,我们建立了一个综合数据库,除了美国国家生物技术信息中心(NCBI)的核苷酸数据库(nt)及其子集外,还包括 RNAcentral 的非编码 RNA(ncRNA)序列、元基因组学 RAST(MG-RAST)的转录组组装和元基因组组装、基因组仓库(GWH)的基因组序列和 MGnify 的基因组序列。由此产生的所有可能的 RNA 序列主数据库(MARS)比 NCBI 的 nt 数据库大 20 倍,比 RNAcentral 大 60 倍。与现有的先进技术相比,新的数据集和新的分割搜索策略大大改进了同源性搜索。对于大多数映射到 Rfam 上的结构化 RNA,它还能产生比 Rfam 中人工编辑的多序列比对 (MSAs) 更准确、更灵敏的多序列比对 (MSA)。结果表明,MARS 与全自动同源性搜索工具 RNAcmap 的结合将有助于改进 ncRNA 的结构和功能推断以及基于 MSAs 的 RNA 语言模型。MARS 可在 https://ngdc.cncb.ac.cn/omix/release/OMIX003037 上访问,RNAcmap3 可在 http://zhouyq-lab.szbl.ac.cn/download/ 上访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
MethylGenotyper: Accurate Estimation of SNP Genotypes and Genetic Relatedness from DNA Methylation Data Evaluating Performance of Different RNA Secondary Structure Prediction Programs Using Self-cleaving Ribozymes Hidden Links Between Skin Microbiome and Skin Imaging Phenome AVM: A Manually Curated Database of Aerosol-transmitted Virus Mutations, Human Diseases, and Drugs APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1