Proceedings. IEEE Computational Systems Bioinformatics Conference最新文献_第5页

Shannon information in complete genomes. 完整基因组中的香农信息。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-08-16 DOI: 10.1109/CSB.2004.153

Chang-Heng Chang, L. Hsieh, T. Chen, Hong-Da Chen, L. Luo, Hoong-Chien Lee

Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences - thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura's neutral theory of evolution.

所有完全测序的原核生物和真核生物基因组中的香农信息以2到10个字母的单词长度来测量。研究发现，在一个尺度依赖的方式下，完整基因组中的香农信息比匹配随机序列中的香农信息要大得多——在短单词的情况下，香农信息要大几千倍。此外，除了恶性疟原虫的14条染色体外，所有可用的完整基因组中的香农信息都属于由一个极其简单的公式给出的普适类。这些数据与基因组生长模型一致，该模型由两个主要成分组成:随机片段重复以不依赖于尺度的方式增加香农信息，随机点突变优先减少更大规模的香农信息。从本研究中得出的结论是，基因组的大规模和粗粒度生长是选择性中性的，这表明木村的中性进化理论得到了独立的证实。

引用次数: 2

Inverse Protein Folding in 2D HP Mode (Extended Abstract) 二维HP模式下的蛋白质逆向折叠(扩展摘要)

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-08-16 DOI: 10.1109/CSB.2004.1332444

Arvind Gupta, Ján Manuch, L. Stacho

The inverse protein folding problem is that of designing an amino acid sequence which has a particular native protein fold. This problem arises in drug design where a particular structure is necessary to ensure proper protein-protein interactions. In this paper we show that in the 2D HP model of Dill it is possible to solve this problem for a broad class of structures. These structures can be used to closely approximate any given structure. One of the most important properties of a good protein is its stability -- the aptitude not to fold simultanously into other structures. We show that for a number of basic structures, our sequences have a unique fold.

蛋白质反折叠问题是设计具有特定天然蛋白质折叠的氨基酸序列的问题。这个问题出现在药物设计中，其中一个特定的结构是必要的，以确保适当的蛋白质-蛋白质相互作用。在本文中，我们证明了在Dill的二维HP模型中，可以解决这一问题。这些结构可以用来近似任何给定的结构。好的蛋白质最重要的特性之一是它的稳定性，即不会同时折叠成其他结构的能力。我们证明了对于一些基本结构，我们的序列有一个独特的褶皱。

引用次数: 6

Shannon information in complete genomes. 完整基因组中的香农信息。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332413

Chang-Heng Chang, Li-Ching Hsieh, Ta-Yuan Chen, Hong-Da Chen, Liaofu Luo, Hoong-Chien Lee

Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences - thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura's neutral theory of evolution.

所有完全测序的原核生物和真核生物基因组中的香农信息以2到10个字母的单词长度来测量。研究发现，在一个尺度依赖的方式下，完整基因组中的香农信息比匹配随机序列中的香农信息要大得多——在短单词的情况下，香农信息要大几千倍。此外，除了恶性疟原虫的14条染色体外，所有可用的完整基因组中的香农信息都属于由一个极其简单的公式给出的普适类。这些数据与基因组生长模型一致，该模型由两个主要成分组成:随机片段重复以不依赖于尺度的方式增加香农信息，随机点突变优先减少更大规模的香农信息。从本研究中得出的结论是，基因组的大规模和粗粒度生长是选择性中性的，这表明木村的中性进化理论得到了独立的证实。

{"title":"Shannon information in complete genomes.","authors":"Chang-Heng Chang, Li-Ching Hsieh, Ta-Yuan Chen, Hong-Da Chen, Liaofu Luo, Hoong-Chien Lee","doi":"10.1109/csb.2004.1332413","DOIUrl":"https://doi.org/10.1109/csb.2004.1332413","url":null,"abstract":"Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences - thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura's neutral theory of evolution.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"20-30"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332413","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Algorithms for association study design using a generalized model of haplotype conservation. 基于广义单倍型守恒模型的关联研究设计算法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01

Russell Schwartz

There is considerable interest in computational methods to assist in the use of genetic polymorphism data for locating disease-related genes. Haplotypes, contiguous sets of correlated variants, may provide a means of reducing the difficulty of the data analysis problems involved. The field to date has been dominated by methods based on the "haplotype block" hypothesis, which assumes discrete population-wide boundaries between conserved genetic segments, but there is strong reason to believe that haplotype blocks do not fully capture true haplotype conservation patterns. In this paper, we address the computational challenges of using a more flexible, block-free representation of haplotype structure called the "haplotype motif" model for downstream analysis problems. We develop algorithms for htSNP selection and missing data inference using this more generalized model of sequence conservation. Application to a dataset from the literature demonstrates the practical value of these block-free methods.

有相当大的兴趣在计算方法，以协助使用基因多态性数据定位疾病相关的基因。单倍型，相关变异的连续集合，可以提供一种方法来降低数据分析问题的难度。迄今为止，该领域主要是基于“单倍型块”假设的方法，该假设假设保守遗传片段之间存在离散的种群范围界限，但有充分的理由相信单倍型块并不能完全捕获真正的单倍型保护模式。在本文中，我们解决了使用更灵活、无块的单倍型结构表示的计算挑战，称为“单倍型基序”模型，用于下游分析问题。我们使用这个更广义的序列守恒模型开发了htSNP选择和缺失数据推断的算法。对文献数据集的应用证明了这些无块方法的实用价值。

引用次数: 0

SPIDER: software for protein identification from sequence tags with de novo sequencing error. 蜘蛛:软件蛋白质鉴定从序列标签与从头测序错误。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332434

Yonghua Han, Bin Ma, Kaizhong Zhang

For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software.

为了使用MS/MS鉴定新的蛋白质，de novo测序软件为每个MS/MS谱计算一个或几个可能的氨基酸序列(称为序列标签)。然后用这些标签来匹配，计算氨基酸突变，蛋白质数据库中的序列。如果从头测序给出了正确的标签，则可以通过这种方法识别蛋白质的同源物，并且可以使用MS-BLAST等软件进行匹配。然而，从头测序通常只能给出部分正确的标签。最常见的错误是一段氨基酸被另一段质量大致相同的氨基酸所取代。我们开发了一种新的高效算法，将序列标签与数据库序列相匹配，用于蛋白质和肽的鉴定。开发了一个软件包“信息平台”，并在互联网上提供给公众免费使用。本文介绍了SPIDER软件的算法和特点。

引用次数: 30

MinPD: distance-based phylogenetic analysis and recombination detection of serially-sampled HIV quasispecies. MinPD:基于距离的HIV准种系统发育分析与重组检测。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01

Patricia Buendia, Giri Narasimhan

A new computational method to study within-host viral evolution is explored to better understand the evolution and pathogenesis of viruses. Traditional phylogenetic tree methods are better suited to study relationships between contemporaneous species, which appear as leaves of a phylogenetic tree. However, viral sequences are often sampled serially from a single host. Consequently, data may be available at the leaves as well as the internal nodes of a phylogenetic tree. Recombination may further complicate the analysis. Such relationships are not easily expressed by traditional phylogenetic methods. We propose a new algorithm, called MinPD, based on minimum pairwise distances. Our algorithm uses multiple distance matrices and correlation rules to output a MinPD tree or network. We test our algorithm using extensive simmulations and apply it to a set of HIV sequence data isolated from one patient over a period of ten years. The proposed visualization of the phylogenetic treenetwork further enhances the benefits of our methods.

探索了一种新的计算方法来研究宿主内病毒的进化，以更好地了解病毒的进化和发病机制。传统的系统发育树方法更适合于研究同时期物种之间的关系，它们表现为系统发育树的叶子。然而，病毒序列通常是从单个宿主连续取样的。因此，数据可以在叶片以及系统发育树的内部节点上获得。重组可能会使分析进一步复杂化。这种关系不容易用传统的系统发育方法来表达。我们提出了一种基于最小成对距离的新算法，称为MinPD。我们的算法使用多个距离矩阵和相关规则来输出MinPD树或网络。我们使用广泛的模拟来测试我们的算法，并将其应用于一组从一个病人身上分离出来的10年HIV序列数据。系统发育树网络的可视化进一步增强了我们方法的优势。

引用次数: 0

Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering. MEDLINE功能基因聚类关键词自动提取两种方案的比较。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332452

Ying Liu, Brian J Ciliax, Karin Borges, Venu Dasigi, Ashwin Ram, Shamkant B Navathe, Ray Dingledine

One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.

微阵列研究的关键挑战之一是从前所未有的基因表达模式数据中获得生物学见解。通过功能关键词关联对基因进行聚类，可以直接了解衍生聚类中基因间功能联系的性质。然而，从生物医学文献中提取的每个基因的关键字列表的质量显著影响聚类结果。我们从MEDLINE中提取描述基因最突出功能的关键词，并将关键词的权重作为基因聚类的特征向量。通过分析结果聚类质量，我们比较了两种关键字加权方案:归一化z分数和词频逆文档频率(TFIDF)。基于查全率和查全率指标选择背景比较集、停止列表和词干提取算法的最佳组合。在四个已知基因组的测试集中，基于TDFIDF加权方案提取的关键词，分层算法正确地将26个基因中的25个分配到适当的聚类中，但使用z-score方法只能将23个分配到适当的聚类中。为了评估从微阵列谱中提取关键字基因簇的加权方案的有效性，我们使用了44个酵母基因作为第二组测试集，这些基因在细胞周期中存在差异表达。使用已建立的聚类质量度量，由tfidf加权关键字产生的结果比由归一化z得分加权关键字产生的结果具有更高的纯度、更低的熵和更高的互信息。优化后的算法可用于将基因从微阵列列表中分类到功能离散的簇中。

{"title":"Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering.","authors":"Ying Liu, Brian J Ciliax, Karin Borges, Venu Dasigi, Ashwin Ram, Shamkant B Navathe, Ray Dingledine","doi":"10.1109/csb.2004.1332452","DOIUrl":"https://doi.org/10.1109/csb.2004.1332452","url":null,"abstract":"One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"394-404"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332452","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Calculation, visualization, and manipulation of MASTs (Maximum Agreement Subtrees). 计算，可视化和桅杆(最大协议子树)的操作。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332453

Shiming Dong, Eileen Kraemer

Unlabelled: Phylogenetic trees are used to represent the evolutionary history of a set of species. Comparison of multiple phylogenetic trees can help researchers find the common classification of a tree group, compare tree construction inferences or obtain distances between trees. We present TreeAnalyzer, a freely available package for phylogenetic tree comparison. A MAST (Maximum Agreement Subtree) algorithm is implemented to compare the trees. Additional features of this software include tree comparison, visualization, manipulation, labeling, and printing.

Availability: http://www.cs.uga.edu/~eileen/TreeAnalyzer.

未标记:系统发育树用于表示一组物种的进化史。多个系统发育树的比较可以帮助研究人员找到一个树群的共同分类，比较树的构造推断或获得树之间的距离。我们提出了TreeAnalyzer，一个免费的系统发育树比较包。采用最大协议子树(MAST)算法对树进行比较。该软件的其他功能包括树比较，可视化，操作，标签和打印。可用性:http://www.cs.uga.edu/ ~艾琳/ TreeAnalyzer。

引用次数: 0

Minimum entropy clustering and applications to gene expression analysis. 最小熵聚类及其在基因表达分析中的应用。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332427

Haifeng Li, Keshu Zhang, Tao Jiang

Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.

聚类是分析基因表达数据的常用方法。本文从信息论的角度提出了一种新的聚类算法。首先，我们提出了最小熵(在后验概率上测量)标准，这是给定观测值的聚类的条件熵。Fano不等式表明它可能是一个很好的聚类准则。我们用Havrda-Charvat的结构熵代替Shannon的熵来推广该准则。有趣的是，当α = 2时，基于结构α -熵的最小熵准则等于最近邻方法的概率误差。这是另一个证据，表明所提出的标准是良好的聚类。利用非参数方法估计后验概率，建立了一种有效的迭代算法来最小化熵。实验结果表明，在调整后的Rand指数方面，聚类算法的性能明显优于k-means/median、分层聚类、SOM和EM。特别是，我们的算法即使在正确的簇数未知的情况下也表现得非常好。此外，大多数聚类算法在存在异常点的情况下会产生较差的分区，而我们的方法可以正确地揭示数据的结构，同时有效地识别异常点。

{"title":"Minimum entropy clustering and applications to gene expression analysis.","authors":"Haifeng Li, Keshu Zhang, Tao Jiang","doi":"10.1109/csb.2004.1332427","DOIUrl":"https://doi.org/10.1109/csb.2004.1332427","url":null,"abstract":"Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"142-51"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332427","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recurrence time statistics: versatile tools for genomic DNA sequence analysis. 复发时间统计:基因组DNA序列分析的通用工具。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332415

Yinhe Cao, Wen-Wen Tung, J B Gao

With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.

随着人类和一些模式生物基因组的完成，以及许多其他生物的基因组等待测序，开发更快的计算工具，能够轻松地从DNA序列中识别结构和提取特征变得越来越重要。DNA序列中一个更重要的结构是重复相关的。通常，在确定DNA序列中的蛋白质编码区或对冗余表达序列标签(ESTs)进行测序之前，必须先对它们进行掩盖。本文报告了一种新的基于递归时间的序列分析方法。该方法可以方便地研究基因组DNA序列的各种周期性，并穷尽地找出基因组DNA序列中所有与重复相关的特征。根据重复时间统计量推导出有效的密码子索引，该索引具有很大程度上与物种无关的特点，并能很好地适用于非常短的序列。有效的密码子索引是成功的基因发现算法的关键要素，对于确定可疑EST是否属于编码区或非编码区特别有用。我们通过研究大肠杆菌、酵母S. cervisivae、线虫C. elegans和人类智人(Homo sapiens)的基因组来说明该方法的力量。计算上，我们的方法是非常有效的。它允许我们在PC上对整个基因组进行分析。

{"title":"Recurrence time statistics: versatile tools for genomic DNA sequence analysis.","authors":"Yinhe Cao, Wen-Wen Tung, J B Gao","doi":"10.1109/csb.2004.1332415","DOIUrl":"https://doi.org/10.1109/csb.2004.1332415","url":null,"abstract":"With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"40-51"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332415","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10