The question of protein homology versus analogy arises when proteins share a common function or a common structural fold without any statistically significant amino acid sequence similarity. Even though two or more proteins do not have similar sequences but share a common fold and the same or closely related function, they are assumed to be homologs, descendant from a common ancestor. The problem of homolog identification is compounded in the case of proteins of 100 or less amino acids. This is due to a limited number of basic single domain folds and to a likelihood of identifying by chance sequence similarity. The latter arises from two conditions: first, any search of the currently very large protein database is likely to identify short regions of chance match; secondly, a direct sequence comparison among a small set of short proteins sharing a similar fold can detect many similar patterns of hydrophobicity even if proteins do not descend from a common ancestor. In an effort to identify distant homologs of the many ubiquitin proteins, we have developed a combined structure and sequence similarity approach that attempts to overcome the above limitations of homolog identification. This approach results in the identification of 90 probable ubiquitin-related proteins, including examples from the two prokaryotic domains of life, Archaea and Bacteria.
Using the definitions of protein folds encoded in a text string, a dynamic programming algorithm was devised to compare these and identify their largest common substructure and calculate the distance (in terms of the number of edit operations) that this lay from each structure. This provided a metric on which the folds were clustered into a 'phylogenetic' tree. This construction differs from previous automatic structure clustering algorithms as it has explicit representation of the structures at 'ancestral' branching nodes, even when these have no corresponding known structure. The resulting tree was compared with that compiled by an 'expert' in the field and while there was broad agreement, differences were found that resulted from differing degrees of emphasis being placed on the types of operations that can be used to transform structures. Some concluding speculations on the relationship of such trees to the evolutionary history and folding of the proteins are advanced.
We previously reported a method, designated functional salvage screen (FSS), to generate protein lineages with new sequence spaces through the functional or structural salvage of a defective protein by employing green fluorescent protein (GFP) as a model protein. Here, in an attempt to mimic a step in the natural evolution process of proteins, the functionally salvaged mutant GFP-I5 with new sequence space, but showing low fluorescence intensity and stability, was selected and fine-tuned by directed evolution. During a course of functional tuning, GFP-I5 was found to evolve rapidly, recovering the spectral traits to those of the parent GFPuv. The mutant 3E4 from the third round of directed evolution possessed four substitutions; three (F64L, E111V and K166Q) were at the original GFP gene and the other (K8N) at the inserted segment. The fluorescence intensity of 3E4 was approximately 28-fold stronger than GFP-I5, and other spectral properties were retained. Biochemical and biophysical investigations suggested that the fine-tuning by directed evolution led the salvaged variant GFP-I5 to a functionally favorable structure, resulting in recovery of stability and fluorescence. Site-directed mutagenesis of the mutated amino acid residues in both GFPuv and GFP-I5 revealed that each amino acid residue has a different effect on the fluorescence intensity, which implies that 3E4 adopted a new evolutionary path with respect to fluorescence characteristics compared with the parent GFPuv. Directed evolution in conjunction with FSS is expected to be used for generating protein lineages with new fitness landscapes.
In this article, we introduce a rapid, protein sequence database-driven approach to characterize all contacting residue pairs present in protein hybrids for inconsistency with protein family structural features. This approach is based on examining contacting residue pairs with different parental origins for different types of potentially unfavorable interactions (i.e. electrostatic repulsion, steric hindrance, cavity formation and hydrogen bond disruption). The identified clashing residue pairs between members of a protein family are then contrasted against functionally characterized hybrid libraries. Comparisons for five different protein recombination studies available in the literature: (i) glycinamide ribonucleotide transformylase (GART) from Escherichia coli (purN) and human (hGART), (ii) human Mu class glutathione S-transferase (GST) M1-1 and M2-2, (iii) beta-lactamase TEM-1 and PSE-4, (iv) catechol-2,3-oxygenase xylE and nahH, and (v) dioxygenases (toluene dioxygenase, tetrachlorobenzene dioxygenase and biphenyl dioxygenase) reveal that the patterns of identified clashing residue pairs are remarkably consistent with experimentally found patterns of functional crossover profiles. Specifically, we show that the proposed residue clash maps are on average 5.0 times more effective than randomly generated clashes and 1.6 times more effective than residue contact maps at explaining the observed crossover distributions among functional members of hybrid libraries. This suggests that residue clash maps can provide quantitative guidelines for the placement of crossovers in the design of protein recombination experiments.