Potential implications of availability of short amino acid sequences in proteins: an old and new approach to protein decoding and design.

Joji M Otaki, Tomonori Gotoh, Haruhiko Yamamoto
{"title":"Potential implications of availability of short amino acid sequences in proteins: an old and new approach to protein decoding and design.","authors":"Joji M Otaki,&nbsp;Tomonori Gotoh,&nbsp;Haruhiko Yamamoto","doi":"10.1016/S1387-2656(08)00004-5","DOIUrl":null,"url":null,"abstract":"<p><p>Three-dimensional structure of a protein molecule is primarily determined by its amino acid sequence, and thus the elucidation of general rules embedded in amino acid sequences is of great importance in protein science and engineering. To extract valuable information from sequences, we propose an analytical method in which a protein sequence is considered to be constructed by serial superimpositions of short amino acid sequences of n amino acid sets, especially triplets (3-aa sets). Using the comprehensive nonredundant protein database, we first examined \"availability\" of all possible combinatorial sets of 8,000 triplet species. Availability score was mathematically defined as an indicator for the relative \"preference\" or \"avoidance\" for a given short constituent sequence to be used in protein chain. Availability scores of real proteins were clearly biased against those of randomly generated proteins. We found many triplet species that occurred in the database more than expected or less than expected. Such bias was extended to longer sets, and we found that some species of pentats (5-aa sets) that occurred reasonably frequently in the randomly generated protein population did not occur at all in any real proteins known today. Availability score was dependent on species, potentially serving as a phylogenetic indicator. Furthermore, we suggest possibilities of various biotechnological applications of characteristic short sequences such as human-specific and pathogen-specific short sequences obtained from availability analysis. Availability score was also dependent on secondary structures, potentially serving as a structural indicator. Availability analysis on triplets may be combined with a comprehensive data collection on the varphi and psi peptide-bond angles of the amino acid at the center of each triplet, i.e., a collection of Ramachandran plots for each triplet. These triplet characters, together with other physicochemical data, will provide us with basic information between protein sequence and structure, by which structure prediction and engineering may be greatly facilitated. Availability analysis may also be useful in identifying word processing units in amino acid sequences based on an analogy to natural languages. Together with other approaches, availability analysis will elucidate general rules hidden in the primary sequences and eventually contributes to rebuilding the paradigm of protein science.</p>","PeriodicalId":79566,"journal":{"name":"Biotechnology annual review","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/S1387-2656(08)00004-5","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biotechnology annual review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/S1387-2656(08)00004-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Three-dimensional structure of a protein molecule is primarily determined by its amino acid sequence, and thus the elucidation of general rules embedded in amino acid sequences is of great importance in protein science and engineering. To extract valuable information from sequences, we propose an analytical method in which a protein sequence is considered to be constructed by serial superimpositions of short amino acid sequences of n amino acid sets, especially triplets (3-aa sets). Using the comprehensive nonredundant protein database, we first examined "availability" of all possible combinatorial sets of 8,000 triplet species. Availability score was mathematically defined as an indicator for the relative "preference" or "avoidance" for a given short constituent sequence to be used in protein chain. Availability scores of real proteins were clearly biased against those of randomly generated proteins. We found many triplet species that occurred in the database more than expected or less than expected. Such bias was extended to longer sets, and we found that some species of pentats (5-aa sets) that occurred reasonably frequently in the randomly generated protein population did not occur at all in any real proteins known today. Availability score was dependent on species, potentially serving as a phylogenetic indicator. Furthermore, we suggest possibilities of various biotechnological applications of characteristic short sequences such as human-specific and pathogen-specific short sequences obtained from availability analysis. Availability score was also dependent on secondary structures, potentially serving as a structural indicator. Availability analysis on triplets may be combined with a comprehensive data collection on the varphi and psi peptide-bond angles of the amino acid at the center of each triplet, i.e., a collection of Ramachandran plots for each triplet. These triplet characters, together with other physicochemical data, will provide us with basic information between protein sequence and structure, by which structure prediction and engineering may be greatly facilitated. Availability analysis may also be useful in identifying word processing units in amino acid sequences based on an analogy to natural languages. Together with other approaches, availability analysis will elucidate general rules hidden in the primary sequences and eventually contributes to rebuilding the paradigm of protein science.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
蛋白质中短氨基酸序列可用性的潜在影响:蛋白质解码和设计的一种新旧方法。
蛋白质分子的三维结构主要由其氨基酸序列决定,因此阐明氨基酸序列的一般规律在蛋白质科学与工程中具有重要意义。为了从序列中提取有价值的信息,我们提出了一种分析方法,其中蛋白质序列被认为是由n个氨基酸集的短氨基酸序列的序列叠加而成,特别是三联体(3-aa集)。利用全面的非冗余蛋白数据库,我们首先检查了8000个三联体物种的所有可能组合集的“可用性”。可用性分数被数学定义为一个相对“偏好”或“避免”的指标,对于一个给定的短组成序列用于蛋白质链。真实蛋白质的可用性分数明显偏向于随机生成的蛋白质。我们发现在数据库中出现的许多三胞胎物种比预期的多或少。这种偏差被扩展到更长的集合,我们发现一些在随机产生的蛋白质群体中出现得相当频繁的五聚体(5-aa集合)在今天已知的任何真正的蛋白质中根本没有出现。可得性评分依赖于物种,可能作为系统发育指标。此外,我们还提出了从可利用性分析中获得的人类特异性和病原体特异性短序列等特征短序列的各种生物技术应用可能性。可用性评分也依赖于二级结构,可能作为结构指标。三联体的可用性分析可以结合对每个三联体中心氨基酸的varphi和psi肽键角的综合数据收集,即每个三联体的Ramachandran图的收集。这些三重特征与其他理化数据将为我们提供蛋白质序列和结构之间的基本信息,为结构预测和工程设计提供极大的便利。基于与自然语言的类比,可用性分析在识别氨基酸序列中的文字处理单元方面也很有用。与其他方法一起,可用性分析将阐明隐藏在初级序列中的一般规则,并最终有助于重建蛋白质科学范式。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The social network of a cell: recent advances in interactome mapping. Gene expression microarray data analysis demystified. The application of low shear modeled microgravity to 3-D cell biology and tissue engineering. Ethnomedicines and ethnomedicinal phytophores against herpesviruses. Free radical processes in green tea polyphenols (GTP) investigated by electron paramagnetic resonance (EPR) spectroscopy.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1