Tuan V. Nguyen, Sunduimijid Bolormaa, Coralie M. Reich, Amanda J. Chamberlain, Christy J. Vander Jagt, Hans D. Daetwyler, Iona M. MacLeod
{"title":"归因的经验准确性与估计准确性:优化序列归因的过滤阈值","authors":"Tuan V. Nguyen, Sunduimijid Bolormaa, Coralie M. Reich, Amanda J. Chamberlain, Christy J. Vander Jagt, Hans D. Daetwyler, Iona M. MacLeod","doi":"10.1186/s12711-024-00942-2","DOIUrl":null,"url":null,"abstract":"Genotype imputation is a cost-effective method for obtaining sequence genotypes for downstream analyses such as genome-wide association studies (GWAS). However, low imputation accuracy can increase the risk of false positives, so it is important to pre-filter data or at least assess the potential limitations due to imputation accuracy. In this study, we benchmarked three different imputation programs (Beagle 5.2, Minimac4 and IMPUTE5) and compared the empirical accuracy of imputation with the software estimated accuracy of imputation (Rsqsoft). We also tested the accuracy of imputation in cattle for autosomal and X chromosomes, SNP and INDEL, when imputing from either low-density or high-density genotypes. The accuracy of imputing sequence variants from real high-density genotypes was higher than from low-density genotypes. In our software benchmark, all programs performed well with only minor differences in accuracy. While there was a close relationship between empirical imputation accuracy and the imputation Rsqsoft, this differed considerably for Minimac4 compared to Beagle 5.2 and IMPUTE5. We found that the Rsqsoft threshold for removing poorly imputed variants must be customised according to the software and this should be accounted for when merging data from multiple studies, such as in meta-GWAS studies. We also found that imposing an Rsqsoft filter has a positive impact on genomic regions with poor imputation accuracy due to large segmental duplications that are susceptible to error-prone alignment. Overall, our results showed that on average the imputation accuracy for INDEL was approximately 6% lower than SNP for all software programs. Importantly, the imputation accuracy for the non-PAR (non-Pseudo-Autosomal Region) of the X chromosome was comparable to autosomal imputation accuracy, while for the PAR it was substantially lower, particularly when starting from low-density genotypes. This study provides an empirically derived approach to apply customised software-specific Rsqsoft thresholds for downstream analyses of imputed variants, such as needed for a meta-GWAS. The very poor empirical imputation accuracy for variants on the PAR when starting from low density genotypes demonstrates that this region should be imputed starting from a higher density of real genotypes.","PeriodicalId":55120,"journal":{"name":"Genetics Selection Evolution","volume":"29 1","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Empirical versus estimated accuracy of imputation: optimising filtering thresholds for sequence imputation\",\"authors\":\"Tuan V. Nguyen, Sunduimijid Bolormaa, Coralie M. Reich, Amanda J. Chamberlain, Christy J. Vander Jagt, Hans D. Daetwyler, Iona M. MacLeod\",\"doi\":\"10.1186/s12711-024-00942-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Genotype imputation is a cost-effective method for obtaining sequence genotypes for downstream analyses such as genome-wide association studies (GWAS). However, low imputation accuracy can increase the risk of false positives, so it is important to pre-filter data or at least assess the potential limitations due to imputation accuracy. In this study, we benchmarked three different imputation programs (Beagle 5.2, Minimac4 and IMPUTE5) and compared the empirical accuracy of imputation with the software estimated accuracy of imputation (Rsqsoft). We also tested the accuracy of imputation in cattle for autosomal and X chromosomes, SNP and INDEL, when imputing from either low-density or high-density genotypes. The accuracy of imputing sequence variants from real high-density genotypes was higher than from low-density genotypes. In our software benchmark, all programs performed well with only minor differences in accuracy. While there was a close relationship between empirical imputation accuracy and the imputation Rsqsoft, this differed considerably for Minimac4 compared to Beagle 5.2 and IMPUTE5. We found that the Rsqsoft threshold for removing poorly imputed variants must be customised according to the software and this should be accounted for when merging data from multiple studies, such as in meta-GWAS studies. We also found that imposing an Rsqsoft filter has a positive impact on genomic regions with poor imputation accuracy due to large segmental duplications that are susceptible to error-prone alignment. Overall, our results showed that on average the imputation accuracy for INDEL was approximately 6% lower than SNP for all software programs. Importantly, the imputation accuracy for the non-PAR (non-Pseudo-Autosomal Region) of the X chromosome was comparable to autosomal imputation accuracy, while for the PAR it was substantially lower, particularly when starting from low-density genotypes. This study provides an empirically derived approach to apply customised software-specific Rsqsoft thresholds for downstream analyses of imputed variants, such as needed for a meta-GWAS. The very poor empirical imputation accuracy for variants on the PAR when starting from low density genotypes demonstrates that this region should be imputed starting from a higher density of real genotypes.\",\"PeriodicalId\":55120,\"journal\":{\"name\":\"Genetics Selection Evolution\",\"volume\":\"29 1\",\"pages\":\"\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genetics Selection Evolution\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12711-024-00942-2\",\"RegionNum\":1,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, DAIRY & ANIMAL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genetics Selection Evolution","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12711-024-00942-2","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, DAIRY & ANIMAL SCIENCE","Score":null,"Total":0}
Empirical versus estimated accuracy of imputation: optimising filtering thresholds for sequence imputation
Genotype imputation is a cost-effective method for obtaining sequence genotypes for downstream analyses such as genome-wide association studies (GWAS). However, low imputation accuracy can increase the risk of false positives, so it is important to pre-filter data or at least assess the potential limitations due to imputation accuracy. In this study, we benchmarked three different imputation programs (Beagle 5.2, Minimac4 and IMPUTE5) and compared the empirical accuracy of imputation with the software estimated accuracy of imputation (Rsqsoft). We also tested the accuracy of imputation in cattle for autosomal and X chromosomes, SNP and INDEL, when imputing from either low-density or high-density genotypes. The accuracy of imputing sequence variants from real high-density genotypes was higher than from low-density genotypes. In our software benchmark, all programs performed well with only minor differences in accuracy. While there was a close relationship between empirical imputation accuracy and the imputation Rsqsoft, this differed considerably for Minimac4 compared to Beagle 5.2 and IMPUTE5. We found that the Rsqsoft threshold for removing poorly imputed variants must be customised according to the software and this should be accounted for when merging data from multiple studies, such as in meta-GWAS studies. We also found that imposing an Rsqsoft filter has a positive impact on genomic regions with poor imputation accuracy due to large segmental duplications that are susceptible to error-prone alignment. Overall, our results showed that on average the imputation accuracy for INDEL was approximately 6% lower than SNP for all software programs. Importantly, the imputation accuracy for the non-PAR (non-Pseudo-Autosomal Region) of the X chromosome was comparable to autosomal imputation accuracy, while for the PAR it was substantially lower, particularly when starting from low-density genotypes. This study provides an empirically derived approach to apply customised software-specific Rsqsoft thresholds for downstream analyses of imputed variants, such as needed for a meta-GWAS. The very poor empirical imputation accuracy for variants on the PAR when starting from low density genotypes demonstrates that this region should be imputed starting from a higher density of real genotypes.
期刊介绍:
Genetics Selection Evolution invites basic, applied and methodological content that will aid the current understanding and the utilization of genetic variability in domestic animal species. Although the focus is on domestic animal species, research on other species is invited if it contributes to the understanding of the use of genetic variability in domestic animals. Genetics Selection Evolution publishes results from all levels of study, from the gene to the quantitative trait, from the individual to the population, the breed or the species. Contributions concerning both the biological approach, from molecular genetics to quantitative genetics, as well as the mathematical approach, from population genetics to statistics, are welcome. Specific areas of interest include but are not limited to: gene and QTL identification, mapping and characterization, analysis of new phenotypes, high-throughput SNP data analysis, functional genomics, cytogenetics, genetic diversity of populations and breeds, genetic evaluation, applied and experimental selection, genomic selection, selection efficiency, and statistical methodology for the genetic analysis of phenotypes with quantitative and mixed inheritance.