Missing genotype imputation in non-model species using self-organizing maps.

IF 5.5 1区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Molecular Ecology Resources Pub Date : 2024-07-06 DOI:10.1111/1755-0998.13992
Fernando Mora-Márquez, Juan Carlos Nuño, Álvaro Soto, Unai López de Heredia
{"title":"Missing genotype imputation in non-model species using self-organizing maps.","authors":"Fernando Mora-Márquez, Juan Carlos Nuño, Álvaro Soto, Unai López de Heredia","doi":"10.1111/1755-0998.13992","DOIUrl":null,"url":null,"abstract":"<p><p>Current methodologies of genome-wide single-nucleotide polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis, and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on self-organizing maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. The method explores genotype datasets to select SNP loci to build binary vectors from the genotypes, and initializes and trains neural networks for each query missing SNP genotype. The SOM-derived clustering is then used to impute the best genotype. To automate the imputation process, we have implemented gtImputation, an open-source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations even for SNPs with alleles at low frequency and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e13992"},"PeriodicalIF":5.5000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Ecology Resources","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1111/1755-0998.13992","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Current methodologies of genome-wide single-nucleotide polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis, and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on self-organizing maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. The method explores genotype datasets to select SNP loci to build binary vectors from the genotypes, and initializes and trains neural networks for each query missing SNP genotype. The SOM-derived clustering is then used to impute the best genotype. To automate the imputation process, we have implemented gtImputation, an open-source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations even for SNPs with alleles at low frequency and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用自组织图对非模式物种中的缺失基因型进行估算。
目前的全基因组单核苷酸多态性(SNP)基因分型方法会产生大量缺失数据,可能会影响统计推断并使实验结果出现偏差。基因型估算通常用于研究充分的物种,以缓冲下游分析的影响,有几种算法可用于填补缺失的基因型。由于缺乏参考单倍型面板,这些方法无法用于非模式生物的基因组研究。作为一种替代方法,我们采用了机器学习算法来探索基因型数据并估计缺失的基因型。在这里,我们提出了一种基于自组织图(SOM)的估算方法,自组织图是一种广泛使用的神经网络,由空间分布的神经元组成,可将相似的输入聚类到相近的神经元中。该方法通过探索基因型数据集来选择 SNP 位点,从而从基因型中建立二进制向量,并为每个查询缺失的 SNP 基因型初始化和训练神经网络。然后利用 SOM 衍生的聚类来估算最佳基因型。为了实现归因过程的自动化,我们实现了 gtImputation,这是一个用 Python3 编程的开源应用程序,具有用户友好的图形用户界面,以方便整个过程。通过在几个基准基因型数据集上与其他可用的归因算法比较其准确性、精确性和灵敏度,验证了该方法的性能。即使是等位基因频率较低的 SNP,我们的方法也能产生高度准确和精确的基因型归约,而且性能优于其他算法,尤其是在非亲缘关系的混合人群数据集上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Molecular Ecology Resources
Molecular Ecology Resources 生物-进化生物学
CiteScore
15.60
自引率
5.20%
发文量
170
审稿时长
3 months
期刊介绍: Molecular Ecology Resources promotes the creation of comprehensive resources for the scientific community, encompassing computer programs, statistical and molecular advancements, and a diverse array of molecular tools. Serving as a conduit for disseminating these resources, the journal targets a broad audience of researchers in the fields of evolution, ecology, and conservation. Articles in Molecular Ecology Resources are crafted to support investigations tackling significant questions within these disciplines. In addition to original resource articles, Molecular Ecology Resources features Reviews, Opinions, and Comments relevant to the field. The journal also periodically releases Special Issues focusing on resource development within specific areas.
期刊最新文献
Chromosomal-Level Genome Suggests Adaptive Constraints Leading to the Historical Population Decline in an Extremely Endangered Plant. Development of SNP Panels from Low-Coverage Whole Genome Sequencing (lcWGS) to Support Indigenous Fisheries for Three Salmonid Species in Northern Canada. Probe Capture Enrichment Sequencing of amoA Genes Improves the Detection of Diverse Ammonia-Oxidising Archaeal and Bacterial Populations. HMicroDB: A Comprehensive Database of Herpetofaunal Microbiota With a Focus on Host Phylogeny, Physiological Traits, and Environment Factors. OGU: A Toolbox for Better Utilising Organelle Genomic Data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1