从5061只绵羊测序数据中改进遗传变异鉴定的计算框架。

IF 6.3 Q1 AGRICULTURE, DAIRY & ANIMAL SCIENCE Journal of Animal Science and Biotechnology Pub Date : 2023-10-02 DOI:10.1186/s40104-023-00923-3

Shangqian Xie, Karissa Isaacs, Gabrielle Becker, Brenda M Murdoch

{"title":"从5061只绵羊测序数据中改进遗传变异鉴定的计算框架。","authors":"Shangqian Xie, Karissa Isaacs, Gabrielle Becker, Brenda M Murdoch","doi":"10.1186/s40104-023-00923-3","DOIUrl":null,"url":null,"abstract":"Background: Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping.Results: In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154).Conclusion: The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.","PeriodicalId":64067,"journal":{"name":"Journal of Animal Science and Biotechnology","volume":"14 1","pages":"127"},"PeriodicalIF":6.3000,"publicationDate":"2023-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10544426/pdf/","citationCount":"0","resultStr":"{\"title\":\"A computational framework for improving genetic variants identification from 5,061 sheep sequencing data.\",\"authors\":\"Shangqian Xie, Karissa Isaacs, Gabrielle Becker, Brenda M Murdoch\",\"doi\":\"10.1186/s40104-023-00923-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping.Results: In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154).Conclusion: The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.\",\"PeriodicalId\":64067,\"journal\":{\"name\":\"Journal of Animal Science and Biotechnology\",\"volume\":\"14 1\",\"pages\":\"127\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2023-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10544426/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Animal Science and Biotechnology\",\"FirstCategoryId\":\"1089\",\"ListUrlMain\":\"https://doi.org/10.1186/s40104-023-00923-3\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, DAIRY & ANIMAL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Animal Science and Biotechnology","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1186/s40104-023-00923-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, DAIRY & ANIMAL SCIENCE","Score":null,"Total":0}

引用次数: 0

摘要

背景：泛基因组学是最近出现的一种策略，可以用来提供更全面的遗传变异特征。联合调用通常用于组合多个相关样本中已识别的变体。然而，使用来自多个样本的相互支持信息来改进变异识别对于群体规模的基因分型来说仍然相当有限。结果：在这项研究中，我们通过整合测序误差和优化多个样本数据的相互支持信息，开发了一个用于联合调用5061只绵羊遗传变异的计算框架。通过四个步骤从多个样本中准确识别变异：（1）利用泊松模型结合碱基序列误差潜力计算GATK和Freebayes两种广泛使用的算法的变异概率；（2）使用GATK和Freebayes从至少两个样本中一致识别的具有高映射质量的变体来构建原始高置信度识别（rHID）变体数据库；（3）使用rHID数据库，对单个样本中识别的高置信度变异按概率值排序，并由错误发现率（FDR）控制；（4）为了避免从rHID数据库中消除潜在的真实变体，对未通过FDR的变体进行重新检查，以挽救潜在的真实变种，并确保高精度识别变体。结果表明，在我们的新方法之后，来自Freebayes和GATK的SNPs和Indels的一致性百分比与原始变体相比显著提高了12%-32%，并且有利地发现绵羊个体的低频变体涉及几个性状，包括乳头数（GPC5）、瘙痒病理学（PAPSS2）、季节性繁殖和产仔数（GRM1）、毛色（RAB27A），和慢病毒易感性（TMEM154）。结论：新方法采用计算策略减少了假阳性的数量，同时提高了遗传变异的识别率。通过使用任何额外的样本或测序数据信息，该策略没有产生任何额外的成本，并且有利地识别出对动物育种的实际应用很重要的罕见变体。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A computational framework for improving genetic variants identification from 5,061 sheep sequencing data.

Background: Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping.

Results: In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154).

Conclusion: The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Animal Science and Biotechnology

CiteScore

10.30

自引率

0.00%

发文量

822