Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy

Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes
{"title":"Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy","authors":"Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes","doi":"arxiv-2405.01715","DOIUrl":null,"url":null,"abstract":"Advances in high throughput sequencing technologies provide a large number of\ngenomes to be analyzed, so computational methodologies play a crucial role in\nanalyzing and extracting knowledge from the data generated. Investigating\ngenomic mutations is critical because of their impact on chromosomal evolution,\ngenetic disorders, and diseases. It is common to adopt aligning sequences for\nanalyzing genomic variations, however, this approach can be computationally\nexpensive and potentially arbitrary in scenarios with large datasets. Here, we\npresent a novel method for identifying single nucleotide polymorphisms (SNPs)\nin DNA sequences from assembled genomes. This method uses the principle of\nmaximum entropy to select the most informative k-mers specific to the variant\nunder investigation. The use of this informative k-mer set enables the\ndetection of variant-specific mutations in comparison to a reference sequence.\nIn addition, our method offers the possibility of classifying novel sequences\nwith no need for organism-specific information. GRAMEP demonstrated high\naccuracy in both in silico simulations and analyses of real viral genomes,\nincluding Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate\nSARS-CoV-2 variant identification while demonstrating a lower computational\ncost compared to the gold-standard statistical tools. The source code for this\nproof-of-concept implementation is freely available at\nhttps://github.com/omatheuspimenta/GRAMEP.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.01715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Advances in high throughput sequencing technologies provide a large number of genomes to be analyzed, so computational methodologies play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations, however, this approach can be computationally expensive and potentially arbitrary in scenarios with large datasets. Here, we present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This method uses the principle of maximum entropy to select the most informative k-mers specific to the variant under investigation. The use of this informative k-mer set enables the detection of variant-specific mutations in comparison to a reference sequence. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of real viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to the gold-standard statistical tools. The source code for this proof-of-concept implementation is freely available at https://github.com/omatheuspimenta/GRAMEP.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用基于最大熵原理的无配对方法 GRAMEP 鉴定基因组中的 SNPs
高通量测序技术的进步提供了大量待分析的基因组,因此计算方法在分析和从生成的数据中提取知识方面发挥着至关重要的作用。基因组突变对染色体进化、遗传疾病和疾病都有影响,因此研究基因组突变至关重要。采用序列比对分析基因组变异的方法很常见,但这种方法计算成本高,而且在数据集较大的情况下可能会出现任意性。在这里,我们提出了一种从组装基因组中识别 DNA 序列中单核苷酸多态性(SNPs)的新方法。该方法利用最大熵原理,针对所研究的变异选择信息量最大的 k-位点。此外,我们的方法还提供了对新序列进行分类的可能性,而无需生物体特异性信息。在对包括登革热、HIV 和 SARS-CoV-2 在内的真实病毒基因组进行硅模拟和分析时,GRAMEP 都表现出了很高的准确性。与黄金标准统计工具相比,我们的方法既能保持对 SARS-CoV-2 变异识别的准确性,又能降低计算成本。这一概念验证实现的源代码可在https://github.com/omatheuspimenta/GRAMEP 免费获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking wgatools: an ultrafast toolkit for manipulating whole genome alignments Selecting Differential Splicing Methods: Practical Considerations Advancements in colored k-mer sets: essentials for the curious Advancements in practical k-mer sets: essentials for the curious
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1