Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences - thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura's neutral theory of evolution.
There is considerable interest in computational methods to assist in the use of genetic polymorphism data for locating disease-related genes. Haplotypes, contiguous sets of correlated variants, may provide a means of reducing the difficulty of the data analysis problems involved. The field to date has been dominated by methods based on the "haplotype block" hypothesis, which assumes discrete population-wide boundaries between conserved genetic segments, but there is strong reason to believe that haplotype blocks do not fully capture true haplotype conservation patterns. In this paper, we address the computational challenges of using a more flexible, block-free representation of haplotype structure called the "haplotype motif" model for downstream analysis problems. We develop algorithms for htSNP selection and missing data inference using this more generalized model of sequence conservation. Application to a dataset from the literature demonstrates the practical value of these block-free methods.
For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software.
A new computational method to study within-host viral evolution is explored to better understand the evolution and pathogenesis of viruses. Traditional phylogenetic tree methods are better suited to study relationships between contemporaneous species, which appear as leaves of a phylogenetic tree. However, viral sequences are often sampled serially from a single host. Consequently, data may be available at the leaves as well as the internal nodes of a phylogenetic tree. Recombination may further complicate the analysis. Such relationships are not easily expressed by traditional phylogenetic methods. We propose a new algorithm, called MinPD, based on minimum pairwise distances. Our algorithm uses multiple distance matrices and correlation rules to output a MinPD tree or network. We test our algorithm using extensive simmulations and apply it to a set of HIV sequence data isolated from one patient over a period of ten years. The proposed visualization of the phylogenetic treenetwork further enhances the benefits of our methods.
One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.
Unlabelled: Phylogenetic trees are used to represent the evolutionary history of a set of species. Comparison of multiple phylogenetic trees can help researchers find the common classification of a tree group, compare tree construction inferences or obtain distances between trees. We present TreeAnalyzer, a freely available package for phylogenetic tree comparison. A MAST (Maximum Agreement Subtree) algorithm is implemented to compare the trees. Additional features of this software include tree comparison, visualization, manipulation, labeling, and printing.
Availability: http://www.cs.uga.edu/~eileen/TreeAnalyzer.
Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.
With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.