A method of latent periodicity search is developed. We use mutual information to reveal the latent periodicity of mRNA sequences. The latent periodicity of an mRNA sequence is a periodicity with a low level of similarity between any two periods inside the mRNA sequence. The mutual information between an artificial numerical sequence and an mRNA sequence is calculated. The length of the artificial sequence period is varied from 2 to 150. The high level of the mutual information between artificial and mRNA sequences allows us to find any type of latent periodicity of mRNA sequence. The latent periodicity of many mRNA coding regions has been found. For example, the retinoblastoma gene of HSRBS clone contains a region with a latent period equal to 45 bases. The A-RAF oncogene of HSARAFIR clone contains a region with a latent period equal to 84 bases. Integrated sequences for the regions with latent periodicity are determined. The potential significance of latent periodicity is discussed.
Results: We have produced a computer program, named sim3, that solves the following computational problem. Two DNA sequences are given, where the shorter sequence is very similar to some contiguous region of the longer sequence. Sim3 determines such a similar region of the longer sequence, and then computes an optimal set of single-nucleotide changes (i.e. insertions, deletions or substitutions) that will convert the shorter sequence to that region. Thus, the alignment scoring scheme is designed to model sequencing errors, rather than evolutionary processes. The program can align a 100 kb sequence to a 1 megabase sequence in a few seconds on a workstation, provided that there are very few differences between the shorter sequence and some region in the longer sequence. The program has been used to assemble sequence data for the Genomes Division at the National Center for Biotechnology Information.
Motivation: When evaluating the results of a sequence similarity search, there are many situations where it can be useful to determine whether sequences appearing in the results share some distinguishing characteristic. Such dependencies between database entries are often not readily identifiable, but can yield important new insights into the biological function of a gene or protein.
Results: We have developed a program called CBLAST that sorts the results of a BLAST sequence similarity search according to sequence membership in user-defined 'clusters' of sequences. To demonstrate the utility of this application, we have constructed two cluster databases. The first describes clusters of nucleotide sequences representing the same gene, as documented in the UNIGENE database, and the second describes clusters of protein sequences which are members of the protein families documented in the PROSITE database. Cluster databases and the CBLAST post-processor provide an efficient mechanism for identifying and exploring relationships and dependencies between new sequences and database entries.
Comparison of the secondary structure of the 5' non-coding region of poliovirus 3 RNA derived from the genetic algorithm with the model of Skinner et al. (J. Mol. Biol., 207, 379-392, 1989) demonstrates many of the confirmed structural elements. The genetic algorithm (Shapiro and Navetta, J. Supercomput., 8, 195-201, 1994) generates a population of all possible stems, then mixes, combines, and recombines these stems in multiple iterations on a massively parallel computer, ultimately selecting a most fit structure based on its energy. The secondary structure of the region containing the determinants of neurovirulence was better predicted using the genetic algorithm, whereas the dynamic programming algorithm (Zuker, Science, 244, 48-52, 1989) required phylogenetic comparative sequence analysis to arrive at the correct conclusion. In addition, artificial mutations were introduced throughout this region of the genome and although rearrangements in structure may occur, many structures persisted, suggesting that the given structures thus selected may have evolved to withstand isolated mutations. The genetic algorithm-derived structure for the 5' non-coding region compares favorably with the biological data and functions previously described, and contains all of the 'persistent' structures, suggesting also that the persistence factor may be an aid to validating structures.
Motivation: What forces maintain transposable elements (TEs) in genomes and populations is one of the main questions to understand the dynamics of these elements, but the exact nature of these forces is still a matter of speculation. To test theoretical models of TE population dynamics, we need many data on the genomic distributions of various elements. These data are now accumulating for the species Drosophila melanogaster, but they are scattered in the literature.
Results: The knowledge base DROSOPOSON thus brings together: (1) data available on Drosophila chromosomal localizations of TE insertions and on features of the polytene chromosomes (DNA content, recombination rate, break-points, etc); (2) statistical methods aimed at analysing the distribution of the TE insertions along the chromosomes. In this paper, we present the structure of the base, the data and the statistical methods. Theoretical models of containment of TE copy number in Drosophila can thus be tested.
A general algorithm is presented for identifying sets of positions in multiple sequence alignments that best characterize an a priori partitioning such as those determined by inhibition studies or other experimental techniques. The algorithm explores combinations of polymorphic columns in the alignment and evaluates how well these sites reflect the original input partition. Partitions across the polymorphic columns are derived using a tree building procedure with conventional amino acid substitution matrices. Elucidation of those amino acids which govern the biochemical behaviour of a protein with a given substrate or inhibitor can provide insights towards an understanding of the tertiary conformation of the protein. Since it is likely that such positions will be spatially clustered in the protein fold, these positions may give rise to useful distance constraints for substantiating model protein structures. The method is exemplified using data for a set of human mu class glutathione S-transferases. A novel aspect for predicting the behaviour of new polymorphic sequences is also discussed.