首页 > 最新文献

Proceedings of the ... Asia-Pacific bioinformatics conference最新文献

英文 中文
AlignScope: A Visual Mining Tool for Gene Team Finding with Whole Genome Alignment AlignScope:一个可视化的挖掘工具,用于基因团队发现与全基因组比对
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0010
Hee-Jeong Jin, Hyeno Kim, Jeong-Hyeon Choi, Hwan-Gue Cho
One of the main issues in comparative genomics is the study of chromosomal gene order in one or more related species. Recently identifying sets of orthologous genes in several genomes has become getting important, since a cluster of similar genes helps us to predict the function of unknown genes. For this purpose, the whole genome alignment is usually used to determine horizontal gene transfer, gene duplication, and gene loss between two related genomes. Also it is well known that a novel visualization tool of the whole genome alignment would be very useful for understanding genome organization and the evolutionary process. In this paper, we propose a method for identifying and visualizing the alignment of the whole genome alignment, especially for detecting gene clusters between two aligned genomes. Since the current rigorous algorithm for finding gene clusters has strong and artificial constraints, they are not useful for coping with “noisy” alignments. We developed the system AlignScope to provide a simplified structure for genome alignment at any level, and also to help us to find gene clusters. In this experiment, we have tested AlignScope on several microbial genomes. Alignment is a procedure that compares two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. This procedure assists in designating the functions of unknown proteins, determining the relatedness of organisms, and identifying structurally and functionally important elements and other useful functions. 9;12 Many widely divergent organisms are descended from a common ancestor through a process called evolution. The inheritance patterns and diversities of these organisms have significant information regarding the nature of small and large-scale evolutionary events. The complexity and the size of the genome make it difficult to analyze. Because the large amount of biological noises is present when visualizing genomes, it is not enough to simply draw the aligned pairs of various genomes. Therefore an alignment visualization tool needs to provide a method for viewing the global structure of whole genome alignment in a simplified form at any level of detail. Figure-1 clearly illustrates this problem. In Figure-1, the resolution of the snapshot is 800 by 600 pixels, so one pixel corresponds about 6000 bases of a given genome sequence. Currently there are several systems for visualizing the alignment of genomes. The NCBI Map Viewer 14 provides graphical displays of biological features on NCBI’s as
比较基因组学的一个主要问题是研究一个或多个相关物种的染色体基因顺序。最近,在几个基因组中识别同源基因组变得越来越重要,因为一组相似的基因可以帮助我们预测未知基因的功能。为此,全基因组比对通常用于确定两个相关基因组之间的水平基因转移、基因复制和基因丢失。同时,一种新的全基因组比对的可视化工具对于理解基因组组织和进化过程是非常有用的。在本文中,我们提出了一种全基因组比对的识别和可视化方法,特别是用于检测两个比对基因组之间的基因簇。由于目前用于寻找基因簇的严格算法具有强而人为的约束,因此它们不适用于处理“噪声”比对。我们开发了AlignScope系统,为任何水平的基因组比对提供了一个简化的结构,也帮助我们找到基因簇。在这个实验中,我们在几个微生物基因组上测试了AlignScope。对齐是通过搜索序列中相同顺序的一系列单个字符或字符模式来比较两个或多个序列的过程。这一过程有助于指定未知蛋白质的功能,确定生物体的相关性,识别结构和功能上重要的元素和其他有用的功能。[9,12]许多差别很大的生物都是通过所谓的进化过程从一个共同的祖先进化而来的。这些生物的遗传模式和多样性对小型和大规模进化事件的性质具有重要的信息。基因组的复杂性和大小使其难以分析。由于基因组可视化过程中存在大量的生物噪声,简单地绘制各种基因组的排列对是不够的。因此,比对可视化工具需要提供一种在任何细节水平上以简化形式查看全基因组比对全局结构的方法。图1清楚地说明了这个问题。在图1中,快照的分辨率为800 × 600像素,因此一个像素对应一个给定基因组序列的大约6000个碱基。目前有几种可视化基因组排列的系统。NCBI地图查看器14提供了NCBI地图上生物特征的图形显示
{"title":"AlignScope: A Visual Mining Tool for Gene Team Finding with Whole Genome Alignment","authors":"Hee-Jeong Jin, Hyeno Kim, Jeong-Hyeon Choi, Hwan-Gue Cho","doi":"10.1142/9781860947292_0010","DOIUrl":"https://doi.org/10.1142/9781860947292_0010","url":null,"abstract":"One of the main issues in comparative genomics is the study of chromosomal gene order in one or more related species. Recently identifying sets of orthologous genes in several genomes has become getting important, since a cluster of similar genes helps us to predict the function of unknown genes. For this purpose, the whole genome alignment is usually used to determine horizontal gene transfer, gene duplication, and gene loss between two related genomes. Also it is well known that a novel visualization tool of the whole genome alignment would be very useful for understanding genome organization and the evolutionary process. In this paper, we propose a method for identifying and visualizing the alignment of the whole genome alignment, especially for detecting gene clusters between two aligned genomes. Since the current rigorous algorithm for finding gene clusters has strong and artificial constraints, they are not useful for coping with “noisy” alignments. We developed the system AlignScope to provide a simplified structure for genome alignment at any level, and also to help us to find gene clusters. In this experiment, we have tested AlignScope on several microbial genomes. Alignment is a procedure that compares two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. This procedure assists in designating the functions of unknown proteins, determining the relatedness of organisms, and identifying structurally and functionally important elements and other useful functions. 9;12 Many widely divergent organisms are descended from a common ancestor through a process called evolution. The inheritance patterns and diversities of these organisms have significant information regarding the nature of small and large-scale evolutionary events. The complexity and the size of the genome make it difficult to analyze. Because the large amount of biological noises is present when visualizing genomes, it is not enough to simply draw the aligned pairs of various genomes. Therefore an alignment visualization tool needs to provide a method for viewing the global structure of whole genome alignment in a simplified form at any level of detail. Figure-1 clearly illustrates this problem. In Figure-1, the resolution of the snapshot is 800 by 600 pixels, so one pixel corresponds about 6000 bases of a given genome sequence. Currently there are several systems for visualizing the alignment of genomes. The NCBI Map Viewer 14 provides graphical displays of biological features on NCBI’s as","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"51 1","pages":"69-78"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74803447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Inference of Gene Regulatory Networks from Microarray Data: A Fuzzy Logic Approach 从微阵列数据推断基因调控网络:一种模糊逻辑方法
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0005
P.C.H. Ma, Keith C. C. Chan
Recent developments in large-scale monitoring of gene expression such as DNA microarrays have made the reconstruction of gene regulatory networks (GRNs) feasible. Before one can infer the structures of these networks, it is important to identify, for each gene in the network, which genes can affect its expression and how they affect it. Most of the existing approaches are useful exploratory tools in the sense that they allow the user to generate biological hypotheses about transcriptional regulations of genes that can then be tested in the laboratory. However, the patterns discovered by these approaches are not adequate for making accurate prediction on gene expression patterns in new or held-out experiments. Therefore, it is difficult to compare performance of different approaches or decide which approach is likely to generate plausible hypothesis. For this reason, we need an approach that not only can provide interpretable insight into the structures of GRNs but also can provide accurate prediction. In this paper, we present a novel fuzzy logic-based approach for this problem. The desired characteristics of the proposed algorithm are as follows: (i) it is able to directly mine the high-dimensional expression data without the need for additional feature selection procedures, (ii) it is able to distinguish between relevant and irrelevant expression data in predicting the expression patterns of predicted genes, (iii) based on the proposed objective interestingness measure, no user-specified thresholds are needed in advance, (iv) it can make explicit hidden patterns discovered for possible biological interpretation, (v) the discovered patterns can be used to predict gene expression patterns in other unseen tissue samples, and (vi) with fuzzy logic, it is robust to noise in the expression data as it hides the boundaries of the adjacent intervals of the quantitative attributes. Experimental results on real expression data show that it can be very effective and the discovered patterns reveal biologically meaningful regulatory relationships of genes that could help the user reconstructing the underlying structures of GRNs.
最近在基因表达的大规模监测方面的发展,如DNA微阵列,使得基因调控网络(grn)的重建成为可能。在可以推断这些网络的结构之前,重要的是要确定,对于网络中的每个基因,哪些基因可以影响其表达以及它们如何影响它。大多数现有的方法都是有用的探索性工具,因为它们允许用户产生关于基因转录调控的生物学假设,然后可以在实验室中进行测试。然而,通过这些方法发现的模式并不足以在新的或持续的实验中对基因表达模式做出准确的预测。因此,很难比较不同方法的性能或决定哪种方法可能产生合理的假设。因此,我们需要一种方法,不仅可以提供对grn结构的可解释的见解,而且可以提供准确的预测。在本文中,我们提出了一种新的基于模糊逻辑的方法来解决这个问题。本文算法的期望特性如下:(i)它能够直接挖掘高维表达数据,而不需要额外的特征选择程序,(ii)它能够在预测预测基因的表达模式时区分相关和不相关的表达数据,(iii)基于提出的客观兴趣度测量,不需要预先指定用户阈值,(iv)它可以为可能的生物学解释发现明确的隐藏模式。(v)发现的模式可用于预测其他看不见的组织样本中的基因表达模式,并且(vi)使用模糊逻辑,它对表达数据中的噪声具有鲁棒性,因为它隐藏了定量属性相邻间隔的边界。基于真实表达数据的实验结果表明,该方法非常有效,所发现的模式揭示了具有生物学意义的基因调控关系,可以帮助用户重建grn的底层结构。
{"title":"Inference of Gene Regulatory Networks from Microarray Data: A Fuzzy Logic Approach","authors":"P.C.H. Ma, Keith C. C. Chan","doi":"10.1142/9781860947292_0005","DOIUrl":"https://doi.org/10.1142/9781860947292_0005","url":null,"abstract":"Recent developments in large-scale monitoring of gene expression such as DNA microarrays have made the reconstruction of gene regulatory networks (GRNs) feasible. Before one can infer the structures of these networks, it is important to identify, for each gene in the network, which genes can affect its expression and how they affect it. Most of the existing approaches are useful exploratory tools in the sense that they allow the user to generate biological hypotheses about transcriptional regulations of genes that can then be tested in the laboratory. However, the patterns discovered by these approaches are not adequate for making accurate prediction on gene expression patterns in new or held-out experiments. Therefore, it is difficult to compare performance of different approaches or decide which approach is likely to generate plausible hypothesis. For this reason, we need an approach that not only can provide interpretable insight into the structures of GRNs but also can provide accurate prediction. In this paper, we present a novel fuzzy logic-based approach for this problem. The desired characteristics of the proposed algorithm are as follows: (i) it is able to directly mine the high-dimensional expression data without the need for additional feature selection procedures, (ii) it is able to distinguish between relevant and irrelevant expression data in predicting the expression patterns of predicted genes, (iii) based on the proposed objective interestingness measure, no user-specified thresholds are needed in advance, (iv) it can make explicit hidden patterns discovered for possible biological interpretation, (v) the discovered patterns can be used to predict gene expression patterns in other unseen tissue samples, and (vi) with fuzzy logic, it is robust to noise in the expression data as it hides the boundaries of the adjacent intervals of the quantitative attributes. Experimental results on real expression data show that it can be very effective and the discovered patterns reveal biologically meaningful regulatory relationships of genes that could help the user reconstructing the underlying structures of GRNs.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"15 1","pages":"17-26"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75733276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Microarray Missing Value Imputation by Iterated Local Least Squares 迭代局部最小二乘法的微阵列缺失值估算
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0019
Zhipeng Cai, M. Heydari, Guohui Lin
Microarray gene expression data often contains missing values resulted from various reasons. However, most of the gene expression data analysis algorithms, such as clustering, classification and network design, require complete information, that is, without any missing values. It is therefore very important to accurately impute the missing values before applying the data analysis algorithms. In this paper, anIterated Local Least Squares Imputation method (ILLsimpute) is proposed to estimate the missing values. In ILLsimpute, a similarity threshold is learned using known expression values and at every iteration it is used to obtain a set of coherent genes for every target gene containing missing values. The target gene is then represented as a linear combination of the coherent genes, using the least squares. The algorithm terminates after certain iterations or when the imputation converges. The experimental results on real microarray datasets show that ILLsimpute outperforms three most recent methods on several commonly tested datasets.
由于各种原因,微阵列基因表达数据往往存在缺失值。然而,大多数基因表达数据分析算法,如聚类、分类和网络设计,都需要完整的信息,即没有任何缺失值。因此,在应用数据分析算法之前,准确地估算缺失值是非常重要的。本文提出了迭代局部最小二乘插值法(ILLsimpute)来估计缺失值。在ilsimpute中,使用已知的表达值学习相似阈值,并在每次迭代中使用它为每个包含缺失值的目标基因获得一组连贯的基因。然后使用最小二乘法将目标基因表示为相干基因的线性组合。该算法在经过一定的迭代或当插值收敛时终止。在实际微阵列数据集上的实验结果表明,ILLsimpute在几个常用的测试数据集上优于三种最新的方法。
{"title":"Microarray Missing Value Imputation by Iterated Local Least Squares","authors":"Zhipeng Cai, M. Heydari, Guohui Lin","doi":"10.1142/9781860947292_0019","DOIUrl":"https://doi.org/10.1142/9781860947292_0019","url":null,"abstract":"Microarray gene expression data often contains missing values resulted from various reasons. However, most of the gene expression data analysis algorithms, such as clustering, classification and network design, require complete information, that is, without any missing values. It is therefore very important to accurately impute the missing values before applying the data analysis algorithms. In this paper, anIterated Local Least Squares Imputation method (ILLsimpute) is proposed to estimate the missing values. In ILLsimpute, a similarity threshold is learned using known expression values and at every iteration it is used to obtain a set of coherent genes for every target gene containing missing values. The target gene is then represented as a linear combination of the coherent genes, using the least squares. The algorithm terminates after certain iterations or when the imputation converges. The experimental results on real microarray datasets show that ILLsimpute outperforms three most recent methods on several commonly tested datasets.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"38 1","pages":"159-168"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83623700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Generalized Output-Coding Scheme with SVM for Multiclass Microarray Classification 基于支持向量机的多类微阵列分类的广义输出编码方案
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0021
Li Shen, E.C. Tan
Multiclass cancer classification based on microarray data is described. A generalized output-coding scheme combined with binary classifiers is used. Different coding strategies, decoding functions and feature selection methods are combined and validated on two cancer datasets: GCM and ALL. The effects of these different methods and their combinations are then discussed. The highest testing accuracies achieved are 78% and 100% for the two datasets respectively. The results are considered to be very good when compared with the other researchers’ work.
描述了基于微阵列数据的多类别癌症分类。采用了一种结合二值分类器的广义输出编码方案。将不同的编码策略、解码函数和特征选择方法结合在GCM和ALL两个肿瘤数据集上进行验证。然后讨论了这些不同方法及其组合的效果。两个数据集的最高测试精度分别为78%和100%。与其他研究人员的工作相比,这些结果被认为是非常好的。
{"title":"A Generalized Output-Coding Scheme with SVM for Multiclass Microarray Classification","authors":"Li Shen, E.C. Tan","doi":"10.1142/9781860947292_0021","DOIUrl":"https://doi.org/10.1142/9781860947292_0021","url":null,"abstract":"Multiclass cancer classification based on microarray data is described. A generalized output-coding scheme combined with binary classifiers is used. Different coding strategies, decoding functions and feature selection methods are combined and validated on two cancer datasets: GCM and ALL. The effects of these different methods and their combinations are then discussed. The highest testing accuracies achieved are 78% and 100% for the two datasets respectively. The results are considered to be very good when compared with the other researchers’ work.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"205 1","pages":"179-186"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72959754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Techniques for Assessing Phylogenetic Branch Support: A Performance Study 评估系统发育分支支持的技术:一项性能研究
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0022
Derek A. Ruths, L. Nakhleh
The inference of evolutionary relationships is usually aid ed by a reconstruction method which is expected to produce a reasonably accurate estimation of the true evolutionary history. However, various factors are known to impede the reconstruction process and result in inaccurate estimates of the true evolutionary relationships. Detecting and removing errors (wrong branches) from tree estimates bear great significance on the results of phylogenetic analyses. Methods have been devised for assessing the support of (or confidence in) phylogenetic tree branches, wh ich is one way of quantifying inaccuracies in trees. In this paper, we study, via simulations, the perfo rmance of the most commonly used methods for assessing branch support: bootstrap of maximum likelihood and maximum parsimony trees, consensus of maximum parsimony trees, and consensus of Bayesian inference trees. Under the conditions of our experiments, our findings indicate that the actual amo unt of change along a branch does not have strong impact on the support of that branch. Further, we find t hat bootstrap and Bayesian estimates are generally comparable to each other, and superior to a consensus of maximum parsimony trees. In our opinion, the most significant finding of all is that there is no threshold value for any of the methods that would allow for the elimination of wrong branches while maintaining all correct ones—there are always weakly supported true positive branches.
对进化关系的推断通常借助于一种重建方法,这种方法有望对真实的进化历史作出合理准确的估计。然而,已知各种因素阻碍了重建过程,并导致对真正进化关系的不准确估计。从树的估计中发现和消除错误(错误分支)对系统发育分析的结果具有重要意义。已经设计了方法来评估系统发育树分支的支持度(或置信度),这是量化树的不准确性的一种方法。在本文中,我们通过模拟研究了最常用的评估分支支持度的方法的性能:最大似然树和最大简约树的自举,最大简约树的一致性和贝叶斯推理树的一致性。在我们的实验条件下,我们的发现表明,沿着一个分支的实际变化量并不会对该分支的支持产生强烈的影响。进一步,我们发现自举估计和贝叶斯估计通常是相互比较的,并且优于最大简约树的共识。在我们看来,最重要的发现是,对于任何一种方法,都没有一个阈值,可以在保持所有正确分支的同时消除错误分支——总是存在弱支持的真正分支。
{"title":"Techniques for Assessing Phylogenetic Branch Support: A Performance Study","authors":"Derek A. Ruths, L. Nakhleh","doi":"10.1142/9781860947292_0022","DOIUrl":"https://doi.org/10.1142/9781860947292_0022","url":null,"abstract":"The inference of evolutionary relationships is usually aid ed by a reconstruction method which is expected to produce a reasonably accurate estimation of the true evolutionary history. However, various factors are known to impede the reconstruction process and result in inaccurate estimates of the true evolutionary relationships. Detecting and removing errors (wrong branches) from tree estimates bear great significance on the results of phylogenetic analyses. Methods have been devised for assessing the support of (or confidence in) phylogenetic tree branches, wh ich is one way of quantifying inaccuracies in trees. In this paper, we study, via simulations, the perfo rmance of the most commonly used methods for assessing branch support: bootstrap of maximum likelihood and maximum parsimony trees, consensus of maximum parsimony trees, and consensus of Bayesian inference trees. Under the conditions of our experiments, our findings indicate that the actual amo unt of change along a branch does not have strong impact on the support of that branch. Further, we find t hat bootstrap and Bayesian estimates are generally comparable to each other, and superior to a consensus of maximum parsimony trees. In our opinion, the most significant finding of all is that there is no threshold value for any of the methods that would allow for the elimination of wrong branches while maintaining all correct ones—there are always weakly supported true positive branches.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"500 1","pages":"187-196"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78426686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Novel Approach for Structured Consensus Motif Inference Under Specificity and Quorum Constraints 特异性和群体约束下的结构一致基序推理新方法
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0024
Christine Sinoquet
We address the issue of structured motif inference. This problem is stated as follows: given a set of n DNA sequences and a quorum q (%), find the optimal structured consensus motif described as gaps alternating with specific regions and shared by at least q x n sequences. Our proposal is in the domain of metaheuristics: it runs solutions to convergence through a cooperation between a sampling strategy of the search space and a quick detection of local similarities in small sequence samples. The contributions of this paper are: (1) the design of a stochastic method whose genuine novelty rests on driving the search with a threshold frequency f discrimining between specific regions and gaps; (2) the original way for justifying the operations especially designed; (3) the implementation of a mining tool well adapted to biologists' exigencies: few input parameters are required (quorum q, minimal threshold frequency f, maximal gap length g). Our approach proves efficient on simulated data, promoter sites in Dicot plants and transcription factor binding sites in E. coli genome. Our algorithm, Kaos, compares favorably with MEME and STARS in terms of accuracy.
我们解决了结构化基序推理的问题。这个问题的描述如下:给定一组n个DNA序列和一个quorum q(%),找到最优的结构共识基序,该基序被描述为与特定区域交替的间隙,并且至少被qxn个序列共享。我们的建议是在元启发式领域:它通过搜索空间的采样策略和小序列样本的局部相似性的快速检测之间的合作来运行收敛的解决方案。本文的贡献有:(1)设计了一种随机方法,其真正的新颖性在于用区分特定区域和间隙的阈值频率f驱动搜索;(二)特别设计的作业的原有论证方式;(3)实现了一种适合生物学家需求的挖掘工具:需要很少的输入参数(quorum q,最小阈值频率f,最大间隙长度g)。我们的方法在模拟数据,Dicot植物的启动子位点和大肠杆菌基因组的转录因子结合位点上证明是有效的。我们的算法Kaos在准确率上优于MEME和STARS。
{"title":"A Novel Approach for Structured Consensus Motif Inference Under Specificity and Quorum Constraints","authors":"Christine Sinoquet","doi":"10.1142/9781860947292_0024","DOIUrl":"https://doi.org/10.1142/9781860947292_0024","url":null,"abstract":"We address the issue of structured motif inference. This problem is stated as follows: given a set of n DNA sequences and a quorum q (%), find the optimal structured consensus motif described as gaps alternating with specific regions and shared by at least q x n sequences. Our proposal is in the domain of metaheuristics: it runs solutions to convergence through a cooperation between a sampling strategy of the search space and a quick detection of local similarities in small sequence samples. The contributions of this paper are: (1) the design of a stochastic method whose genuine novelty rests on driving the search with a threshold frequency f discrimining between specific regions and gaps; (2) the original way for justifying the operations especially designed; (3) the implementation of a mining tool well adapted to biologists' exigencies: few input parameters are required (quorum q, minimal threshold frequency f, maximal gap length g). Our approach proves efficient on simulated data, promoter sites in Dicot plants and transcription factor binding sites in E. coli genome. Our algorithm, Kaos, compares favorably with MEME and STARS in terms of accuracy.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"37 1","pages":"207-216"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85727941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Algorithm for String Motif Discovery 一种高效的字符串基序发现算法
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0011
Francis Y. L. Chin, Henry C. M. Leung
Finding common patterns, motifs, in a set of DNA sequences is an important problem in bioinformatics. One common representation of motifs is a string with symbols A, C, G, T and N where N stands for the wildcard symbol. In this paper, we introduce a more general motif discovery problem without any weaknesses of the Planted (l,d)-Motif Problem and also a set of control sequences as an additional input. The existing algorithms using brute force approach for solving similar problem take O(n(t+f)l5) times where t and f are the number of input sequences and control sequences respectively, n is the length of each sequence and l is the length of the motif. We propose an efficient algorithm, called VAS, which has an expected running time O(nfl(nt)(4+1/4)) using O((nt)(4+1/4)) space for any integer k. In particular when k = 3, the time and space complexities are O(nlf (nt)(1.0625)) and O((nt)(1.0625)) respectively. This algorithm makes use of voting and graph representation for better time and space complexities. This technique can also be used to improve the performances of some existing algorithms.
在一组DNA序列中寻找共同的模式,基序是生物信息学中的一个重要问题。图案的一种常见表示是带有符号a、C、G、T和N的字符串,其中N代表通配符符号。在本文中,我们引入了一个更一般的基序发现问题,该问题没有planded -Motif问题的任何弱点,并且还引入了一组控制序列作为附加输入。现有的使用蛮力方法求解类似问题的算法需要O(n(t+f) 15)次,其中t和f分别是输入序列和控制序列的个数,n是每个序列的长度,l是motif的长度。我们提出了一种高效的算法,称为VAS,它对任意整数k使用O((nt)(4+1/4))空间的期望运行时间为O(nfl(nt)(4+1/4))。特别是当k = 3时,时间和空间复杂度分别为O(nlf (nt)(1.0625))和O((nt)(1.0625))。该算法利用投票和图形表示来提高时间和空间复杂度。这种技术也可以用来提高一些现有算法的性能。
{"title":"An Efficient Algorithm for String Motif Discovery","authors":"Francis Y. L. Chin, Henry C. M. Leung","doi":"10.1142/9781860947292_0011","DOIUrl":"https://doi.org/10.1142/9781860947292_0011","url":null,"abstract":"Finding common patterns, motifs, in a set of DNA sequences is an important problem in bioinformatics. One common representation of motifs is a string with symbols A, C, G, T and N where N stands for the wildcard symbol. In this paper, we introduce a more general motif discovery problem without any weaknesses of the Planted (l,d)-Motif Problem and also a set of control sequences as an additional input. The existing algorithms using brute force approach for solving similar problem take O(n(t+f)l5) times where t and f are the number of input sequences and control sequences respectively, n is the length of each sequence and l is the length of the motif. We propose an efficient algorithm, called VAS, which has an expected running time O(nfl(nt)(4+1/4)) using O((nt)(4+1/4)) space for any integer k. In particular when k = 3, the time and space complexities are O(nlf (nt)(1.0625)) and O((nt)(1.0625)) respectively. This algorithm makes use of voting and graph representation for better time and space complexities. This technique can also be used to improve the performances of some existing algorithms.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"6 1","pages":"79-88"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87194360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
On the Inference of Regulatory Elements, Circuits, and Modules 论调节元件、电路和模块的推理
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0001
Wen-Hsiung Li
Advances in genomics have led to the production of various functional genomic data as well as genomic sequence data. This is particularly true in yeasts. Such data have proved to be highly useful for inferring regulatory elements and modules. I shall present studies that I have done with my colleagues and collaborators on the following topics. (1) Detection of transcription factors (including their interactions) involved in a specific function such as the cell cycle, (2) inference of the cis elements (binding sites and sequences) of a transcription factor, (3) reconstruction of the regulatory circuits of genes, and (4) inference of regulatory modules. In all these topics, we have developed methods and have applied them to analyze data from yeasts.
基因组学的进步导致了各种功能基因组数据以及基因组序列数据的产生。在酵母中尤其如此。事实证明,这些数据对于推断监管要素和模块非常有用。我将介绍我与同事和合作者就以下主题所做的研究。(1)检测参与特定功能(如细胞周期)的转录因子(包括它们的相互作用),(2)推断转录因子的顺式元件(结合位点和序列),(3)基因调控回路的重建,(4)推断调控模块。在所有这些主题中,我们已经开发了方法并应用它们来分析酵母的数据。
{"title":"On the Inference of Regulatory Elements, Circuits, and Modules","authors":"Wen-Hsiung Li","doi":"10.1142/9781860947292_0001","DOIUrl":"https://doi.org/10.1142/9781860947292_0001","url":null,"abstract":"Advances in genomics have led to the production of various functional genomic data as well as genomic sequence data. This is particularly true in yeasts. Such data have proved to be highly useful for inferring regulatory elements and modules. I shall present studies that I have done with my colleagues and collaborators on the following topics. (1) Detection of transcription factors (including their interactions) involved in a specific function such as the cell cycle, (2) inference of the cis elements (binding sites and sequences) of a transcription factor, (3) reconstruction of the regulatory circuits of genes, and (4) inference of regulatory modules. In all these topics, we have developed methods and have applied them to analyze data from yeasts.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"186 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83456772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure-Based Chemical Shift Prediction Using Random Forests Non-Linear Regression 基于结构的随机森林非线性回归化学位移预测
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0035
K. Arun, C. Langmead
Protein nuclear magnetic resonance (NMR) chemical shifts are among the most accurately measurable spectroscopic parameters and are closely correlated to protein structure because of their dependence on the local electronic environment. The precise nature of this correlation remains largely unknown. Accurate prediction of chemical shifts from existing structures’ atomic co-ordinates will permit close study of this relationship. This paper presents a novel non- linear regression based approach to chemical shift prediction from protein structure. The regression model employed combines quantum, classical and empirical variables and provides statistically signifi cant improved prediction accuracy over existing chemical shift predictors, across protein backbone atom types. The results presented here were obtained using the Random Forest regression algorithm on a protein entry data set derived from the RefDB re-referenced chemical shift database.
蛋白质核磁共振(NMR)化学位移是最精确可测量的光谱参数之一,并且由于其依赖于局部电子环境而与蛋白质结构密切相关。这种相关性的确切性质在很大程度上仍然未知。从现有结构的原子坐标中准确预测化学位移,将允许对这种关系进行深入研究。本文提出了一种基于非线性回归的蛋白质结构化学位移预测方法。所采用的回归模型结合了量子变量、经典变量和经验变量,并在统计上显著提高了现有的跨蛋白质主链原子类型的化学位移预测器的预测精度。本文给出的结果是使用随机森林回归算法对来自RefDB重新引用的化学位移数据库的蛋白质输入数据集获得的。
{"title":"Structure-Based Chemical Shift Prediction Using Random Forests Non-Linear Regression","authors":"K. Arun, C. Langmead","doi":"10.1142/9781860947292_0035","DOIUrl":"https://doi.org/10.1142/9781860947292_0035","url":null,"abstract":"Protein nuclear magnetic resonance (NMR) chemical shifts are among the most accurately measurable spectroscopic parameters and are closely correlated to protein structure because of their dependence on the local electronic environment. The precise nature of this correlation remains largely unknown. Accurate prediction of chemical shifts from existing structures’ atomic co-ordinates will permit close study of this relationship. This paper presents a novel non- linear regression based approach to chemical shift prediction from protein structure. The regression model employed combines quantum, classical and empirical variables and provides statistically signifi cant improved prediction accuracy over existing chemical shift predictors, across protein backbone atom types. The results presented here were obtained using the Random Forest regression algorithm on a protein entry data set derived from the RefDB re-referenced chemical shift database.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"41 1","pages":"317-326"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74202289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Resolving the Gene Tree and Species Tree Problem by Phylogenetic Mining 用系统发育挖掘解决基因树和物种树问题
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0032
Xiaoxu Han
The gene tree and species tree problem remains a central problem in phylogenomics. To overcome this problem, gene concatenation approaches have been used to combine a certain number of genes randomly from a set of widely distributed orthologous genes selected from genome data to conduct phylogenetic analysis. The random concatenation mechanism prevents us from the further investigations of the inner structures of the gene data set employed to infer the phylogenetic trees and locates the most phylogenetically informative genes. In this work, a phylogenomic mining approach is described to gain knowledge from a gene data set by clustering genes in the gene set through a self-organizing map (SOM) to explore the gene dataset inner structures. From this, the most phylogenetically informative gene set is created by picking the maximum entropy gene from each cluster to infer phylogenetic trees by phylogenetic analysis. Using the same data set, the phylogenetic mining approach performs better than the random gene concatenation approach.
基因树和物种树问题仍然是系统基因组学的核心问题。为了克服这一问题,人们采用基因串联方法,从基因组数据中选择一组分布广泛的同源基因,随机组合一定数量的基因进行系统发育分析。随机连接机制使我们无法进一步研究用于推断系统发育树和定位最具系统发育信息基因的基因数据集的内部结构。在这项工作中,描述了一种系统基因组挖掘方法,通过自组织图谱(SOM)对基因集中的基因进行聚类,以探索基因数据集的内部结构,从而从基因数据集中获得知识。在此基础上,从每个聚类中选取熵值最大的基因,通过系统发育分析推断出系统发育树,从而得到系统发育信息量最大的基因集。使用相同的数据集,系统发育挖掘方法比随机基因连接方法性能更好。
{"title":"Resolving the Gene Tree and Species Tree Problem by Phylogenetic Mining","authors":"Xiaoxu Han","doi":"10.1142/9781860947292_0032","DOIUrl":"https://doi.org/10.1142/9781860947292_0032","url":null,"abstract":"The gene tree and species tree problem remains a central problem in phylogenomics. To overcome this problem, gene concatenation approaches have been used to combine a certain number of genes randomly from a set of widely distributed orthologous genes selected from genome data to conduct phylogenetic analysis. The random concatenation mechanism prevents us from the further investigations of the inner structures of the gene data set employed to infer the phylogenetic trees and locates the most phylogenetically informative genes. In this work, a phylogenomic mining approach is described to gain knowledge from a gene data set by clustering genes in the gene set through a self-organizing map (SOM) to explore the gene dataset inner structures. From this, the most phylogenetically informative gene set is created by picking the maximum entropy gene from each cluster to infer phylogenetic trees by phylogenetic analysis. Using the same data set, the phylogenetic mining approach performs better than the random gene concatenation approach.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"120 1","pages":"287-296"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73544007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the ... Asia-Pacific bioinformatics conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1