Principal Components of Genetic Sequences: Correlations and Significance

Q3 Mathematics Mathematical Biology and Bioinformatics Pub Date : 2021-09-10 DOI:10.17537/2021.16.299

V. Efimov, K. V. Efimov, V. Kovaleva, Y. Matushkin

{"title":"Principal Components of Genetic Sequences: Correlations and Significance","authors":"V. Efimov, K. V. Efimov, V. Kovaleva, Y. Matushkin","doi":"10.17537/2021.16.299","DOIUrl":null,"url":null,"abstract":"\nAny numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA-Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity/“transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.\n","PeriodicalId":53525,"journal":{"name":"Mathematical Biology and Bioinformatics","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17537/2021.16.299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA-Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity/“transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基因序列的主成分:相关性和显著性

任何数值序列都可以用奇异谱分析分解成主成分。我们最近提出了一种新的分析方法- PCA-Seq，它允许计算任何类型的元素序列的数值主成分。特别地，该序列可以由核苷酸碱基对或氨基酸残基组成。对于所获得的主成分的解释和对其可靠性的评估，不可避免地会产生两个问题。对于符号序列主成分的解释，评价它们与序列元素数值特征的相关性是合理的。为了评估序列之间相关性的显著性，人们应该记住，标准显著性标准是基于观察独立性的假设，这通常不满足真实序列。本文讨论了锚引导技术的使用，该技术也是本文作者先前开发的。在这种方法中，假设度量空间的点可以表示对象。当它们结合在一起时，它们构成了一些固定的结构，特别是一个序列。对象被赋予与经典bootstrap中相同的随机整数权重。这足以获得相关系数的自举分布并评估其显著性。以SLC9A1基因(同义词APNH、NHE1、PPP1R143)的编码序列为例，应用锚点自举技术进行基因序列分析。第一主成分与氨基酸序列相应片段的疏水性/“跨膜性”、其苯丙氨酸含量以及相应核苷酸片段中T-和a -含量的差异呈显著相关。早些时候，其他作者在其他基因上也发现了类似的模式。很有可能，它具有更普遍的性质。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Mathematical Biology and Bioinformatics Mathematics-Applied Mathematics

CiteScore

1.10

自引率

0.00%

发文量