Finding haplotypic signatures in proteins

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES GigaScience Pub Date : 2022-12-28 DOI:10.1101/2022.11.21.517096

J. Vašíček, Dafni Skiadopoulou, K. Kuznetsova, Bo Wen, S. Johansson, P. Njølstad, Stefan Bruckner, L. Käll, Marc Vaudel

{"title":"Finding haplotypic signatures in proteins","authors":"J. Vašíček, Dafni Skiadopoulou, K. Kuznetsova, Bo Wen, S. Johansson, P. Njølstad, Stefan Bruckner, L. Käll, Marc Vaudel","doi":"10.1101/2022.11.21.517096","DOIUrl":null,"url":null,"abstract":"The non-random distribution of alleles of common genomic variants produces haplotypes, which are fundamental in medical and population genetic studies. Consequently, protein-coding genes with different co-occurring sets of alleles can encode different amino acid sequences: protein haplotypes. These protein haplotypes are present in biological samples, and detectable by mass spectrometry, but are not accounted for in proteomic searches. Consequently, the impact of haplotypic variation on the results of proteomic searches, and the discoverability of peptides specific to haplotypes remain unknown. Here, we study how common genetic haplotypes influence the proteomic search space and investigate the possibility to match peptides containing multiple amino acid substitutions to a publicly available data set of mass spectra. We found that for 9.96 % of the discoverable amino acid substitutions encoded by common haplotypes, two or more substitutions may co-occur in the same peptide after tryptic digestion of the protein haplotypes. We identified 342 spectra that matched to such multi-variant peptides, and out of the 4,251 amino acid substitutions identified, 6.63 % were covered by multi-variant peptides. However, the evaluation of the reliability of these matches remains challenging, suggesting that refined error rate estimation procedures are needed for such complex proteomic searches. As these become available and the ability to analyze protein haplotypes increases, we anticipate that proteomics will provide new information on the consequences of common variation, across tissues and time.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"1 1","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/2022.11.21.517096","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The non-random distribution of alleles of common genomic variants produces haplotypes, which are fundamental in medical and population genetic studies. Consequently, protein-coding genes with different co-occurring sets of alleles can encode different amino acid sequences: protein haplotypes. These protein haplotypes are present in biological samples, and detectable by mass spectrometry, but are not accounted for in proteomic searches. Consequently, the impact of haplotypic variation on the results of proteomic searches, and the discoverability of peptides specific to haplotypes remain unknown. Here, we study how common genetic haplotypes influence the proteomic search space and investigate the possibility to match peptides containing multiple amino acid substitutions to a publicly available data set of mass spectra. We found that for 9.96 % of the discoverable amino acid substitutions encoded by common haplotypes, two or more substitutions may co-occur in the same peptide after tryptic digestion of the protein haplotypes. We identified 342 spectra that matched to such multi-variant peptides, and out of the 4,251 amino acid substitutions identified, 6.63 % were covered by multi-variant peptides. However, the evaluation of the reliability of these matches remains challenging, suggesting that refined error rate estimation procedures are needed for such complex proteomic searches. As these become available and the ability to analyze protein haplotypes increases, we anticipate that proteomics will provide new information on the consequences of common variation, across tissues and time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在蛋白质中发现单倍型特征

普通基因组变异的等位基因的非随机分布产生单倍型，这是医学和群体遗传研究的基础。因此，具有不同共发生等位基因的蛋白质编码基因可以编码不同的氨基酸序列:蛋白质单倍型。这些蛋白质单倍型存在于生物样品中，可以通过质谱法检测到，但在蛋白质组学搜索中没有考虑到。因此，单倍型变异对蛋白质组学搜索结果的影响，以及单倍型特异性肽的可发现性仍然未知。在这里，我们研究了常见的遗传单倍型如何影响蛋白质组学搜索空间，并研究了将含有多个氨基酸取代的肽与公开可用的质谱数据集相匹配的可能性。我们发现，在9.96%的由普通单倍型编码的氨基酸替换中，经过蛋白质单倍型的胰蛋白酶消化后，同一肽可能同时发生两个或两个以上的替换。我们鉴定了342个与这些多变异肽相匹配的光谱，在鉴定的4251个氨基酸取代中，6.63%被多变异肽覆盖。然而，评估这些匹配的可靠性仍然具有挑战性，这表明需要改进错误率估计程序来进行这种复杂的蛋白质组学搜索。随着这些技术的发展和分析蛋白质单倍型的能力的提高，我们预计蛋白质组学将为跨组织和跨时间的共同变异的后果提供新的信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.