首页 > 最新文献

Computational systems bioinformatics. Computational Systems Bioinformatics Conference最新文献

英文 中文
A grammar based methodology for structural motif finding in ncRNA database search. 基于语法的ncRNA数据库结构基序查找方法。
Daniel J. Quest, W. Tapprich, H. Ali
In recent years, sequence database searching has been conducted through local alignment heuristics, pattern-matching, and comparison of short statistically significant patterns. While these approaches have unlocked many clues as to sequence relationships, they are limited in that they do not provide context-sensitive searching capabilities (e.g. considering pseudoknots, protein binding positions, and complementary base pairs). Stochastic grammars (hidden Markov models HMMs and stochastic context-free grammars SCFG) do allow for flexibility in terms of local context, but the context comes at the cost of increased computational complexity. In this paper we introduce a new grammar based method for searching for RNA motifs that exist within a conserved RNA structure. Our method constrains computational complexity by using a chain of topology elements. Through the use of a case study we present the algorithmic approach and benchmark our approach against traditional methods.
近年来,序列数据库搜索主要通过局部比对启发式、模式匹配、短模式比较等方式进行。虽然这些方法已经解开了许多关于序列关系的线索,但它们的局限性在于它们不提供上下文敏感的搜索功能(例如考虑假结,蛋白质结合位置和互补碱基对)。随机语法(隐马尔可夫模型hmm和随机上下文无关语法SCFG)确实允许在局部上下文方面具有灵活性,但是上下文的代价是增加了计算复杂性。在本文中,我们介绍了一种新的基于语法的方法来搜索存在于保守RNA结构中的RNA基序。我们的方法通过使用拓扑元素链来限制计算复杂度。通过案例研究,我们提出了算法方法,并将我们的方法与传统方法进行了比较。
{"title":"A grammar based methodology for structural motif finding in ncRNA database search.","authors":"Daniel J. Quest, W. Tapprich, H. Ali","doi":"10.1142/9781860948732_0024","DOIUrl":"https://doi.org/10.1142/9781860948732_0024","url":null,"abstract":"In recent years, sequence database searching has been conducted through local alignment heuristics, pattern-matching, and comparison of short statistically significant patterns. While these approaches have unlocked many clues as to sequence relationships, they are limited in that they do not provide context-sensitive searching capabilities (e.g. considering pseudoknots, protein binding positions, and complementary base pairs). Stochastic grammars (hidden Markov models HMMs and stochastic context-free grammars SCFG) do allow for flexibility in terms of local context, but the context comes at the cost of increased computational complexity. In this paper we introduce a new grammar based method for searching for RNA motifs that exist within a conserved RNA structure. Our method constrains computational complexity by using a chain of topology elements. Through the use of a case study we present the algorithmic approach and benchmark our approach against traditional methods.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"215-25"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Composite motifs integrating multiple protein structures increase sensitivity for function prediction. 整合多种蛋白质结构的复合基序增加了功能预测的敏感性。
B. Chen, D. Bryant, Amanda E. Cruess, Joseph H Bylund, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
The study of disease often hinges on the biological function of proteins, but determining protein function is a difficult experimental process. To minimize duplicated effort, algorithms for function prediction seek characteristics indicative of possible protein function. One approach is to identify substructural matches of geometric and chemical similarity between motifs representing known active sites and target protein structures with unknown function. In earlier work, statistically significant matches of certain effective motifs have identified functionally related active sites. Effective motifs must be carefully designed to maintain similarity to functionally related sites (sensitivity) and avoid incidental similarities to functionally unrelated protein geometry (specificity). Existing motif design techniques use the geometry of a single protein structure. Poor selection of this structure can limit motif effectiveness if the selected functional site lacks similarity to functionally related sites. To address this problem, this paper presents composite motifs, which combine structures of functionally related active sites to potentially increase sensitivity. Our experimentation compares the effectiveness of composite motifs with simple motifs designed from single protein structures. On six distinct families of functionally related proteins, leave-one-out testing showed that composite motifs had sensitivity comparable to the most sensitive of all simple motifs and specificity comparable to the average simple motif. On our data set, we observed that composite motifs simultaneously capture variations in active site conformation, diminish the problem of selecting motif structures, and enable the fusion of protein structures from diverse data sources.
疾病的研究往往依赖于蛋白质的生物学功能,但确定蛋白质的功能是一个困难的实验过程。为了尽量减少重复工作,功能预测算法寻求指示可能的蛋白质功能的特征。一种方法是识别代表已知活性位点的基序与具有未知功能的靶蛋白结构之间的几何和化学相似性的亚结构匹配。在早期的工作中,某些有效基序的统计显著匹配已经确定了功能相关的活性位点。有效的基序必须精心设计,以保持与功能相关位点的相似性(敏感性),并避免与功能无关的蛋白质几何结构的偶然相似性(特异性)。现有的基序设计技术使用单个蛋白质结构的几何形状。如果选择的功能位点与功能相关位点缺乏相似性,则这种结构的选择不当会限制基序的有效性。为了解决这个问题,本文提出了复合基序,它结合了功能相关活性位点的结构,以潜在地提高灵敏度。我们的实验比较了复合基序和由单一蛋白质结构设计的简单基序的有效性。在六个不同的功能相关蛋白家族中,留一测试表明,复合基序的敏感性与所有简单基序中最敏感的基序相当,特异性与一般的简单基序相当。在我们的数据集中,我们观察到复合基序同时捕获活性位点构象的变化,减少了选择基序结构的问题,并使来自不同数据源的蛋白质结构融合。
{"title":"Composite motifs integrating multiple protein structures increase sensitivity for function prediction.","authors":"B. Chen, D. Bryant, Amanda E. Cruess, Joseph H Bylund, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki","doi":"10.1142/9781860948732_0035","DOIUrl":"https://doi.org/10.1142/9781860948732_0035","url":null,"abstract":"The study of disease often hinges on the biological function of proteins, but determining protein function is a difficult experimental process. To minimize duplicated effort, algorithms for function prediction seek characteristics indicative of possible protein function. One approach is to identify substructural matches of geometric and chemical similarity between motifs representing known active sites and target protein structures with unknown function. In earlier work, statistically significant matches of certain effective motifs have identified functionally related active sites. Effective motifs must be carefully designed to maintain similarity to functionally related sites (sensitivity) and avoid incidental similarities to functionally unrelated protein geometry (specificity). Existing motif design techniques use the geometry of a single protein structure. Poor selection of this structure can limit motif effectiveness if the selected functional site lacks similarity to functionally related sites. To address this problem, this paper presents composite motifs, which combine structures of functionally related active sites to potentially increase sensitivity. Our experimentation compares the effectiveness of composite motifs with simple motifs designed from single protein structures. On six distinct families of functionally related proteins, leave-one-out testing showed that composite motifs had sensitivity comparable to the most sensitive of all simple motifs and specificity comparable to the average simple motif. On our data set, we observed that composite motifs simultaneously capture variations in active site conformation, diminish the problem of selecting motif structures, and enable the fusion of protein structures from diverse data sources.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"343-55"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Using directed information to build biologically relevant influence networks. 利用定向信息建立生物学相关的影响网络。
Arvind Rao, Alfred O Hero, David J States, James Douglas Engel

The systematic inference of biologically relevant influence networks remains a challenging problem in computational biology. Even though the availability of high-throughput data has enabled the use of probabilistic models to infer the plausible structure of such networks, their true interpretation of the biology of the process is questionable. In this work, we propose a network inference methodology, based on the directed information (DTI) criterion, which incorporates the biology of transcription within the framework, so as to enable experimentally verifiable inference. We use publicly available embryonic kidney and T-cell microarray datasets to demonstrate our results. We present two variants of network inference via DTI (supervised and unsupervised) and the inferred networks relevant to mammalian nephrogenesis as well as T-cell activation. We demonstrate the conformity of the obtained interactions with literature as well as comparison with the coefficient of determination (CoD) method. Apart from network inference, the proposed framework enables the exploration of specific interactions, not just those revealed by data.

生物学相关影响网络的系统推断仍然是计算生物学中的一个具有挑战性的问题。尽管高通量数据的可用性使得使用概率模型来推断这种网络的合理结构成为可能,但它们对这一过程的生物学真正解释是值得怀疑的。在这项工作中,我们提出了一种基于定向信息(DTI)标准的网络推理方法,该方法将转录生物学纳入框架内,从而实现实验验证的推理。我们使用公开可用的胚胎肾脏和t细胞微阵列数据集来证明我们的结果。我们通过DTI提出了两种网络推断的变体(监督和无监督),以及与哺乳动物肾形成和t细胞激活相关的推断网络。我们证明了得到的相互作用与文献的一致性,并与确定系数(CoD)方法进行了比较。除了网络推理之外,所提出的框架还可以探索特定的交互作用,而不仅仅是数据揭示的交互作用。
{"title":"Using directed information to build biologically relevant influence networks.","authors":"Arvind Rao,&nbsp;Alfred O Hero,&nbsp;David J States,&nbsp;James Douglas Engel","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The systematic inference of biologically relevant influence networks remains a challenging problem in computational biology. Even though the availability of high-throughput data has enabled the use of probabilistic models to infer the plausible structure of such networks, their true interpretation of the biology of the process is questionable. In this work, we propose a network inference methodology, based on the directed information (DTI) criterion, which incorporates the biology of transcription within the framework, so as to enable experimentally verifiable inference. We use publicly available embryonic kidney and T-cell microarray datasets to demonstrate our results. We present two variants of network inference via DTI (supervised and unsupervised) and the inferred networks relevant to mammalian nephrogenesis as well as T-cell activation. We demonstrate the conformity of the obtained interactions with literature as well as comparison with the coefficient of determination (CoD) method. Apart from network inference, the proposed framework enables the exploration of specific interactions, not just those revealed by data.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"145-56"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27061634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deconvoluting the BAC-gene relationships using a physical map. 使用物理图谱解卷积bac -基因关系。
Yonghui Wu, Lan Liu, Timothy J Close, Stefano Lonardi

Motivation: The deconvolution of the relationships between BAC clones and genes is a crucial step in the selective sequencing of the regions of interest in a genome. It usually requires combinatorial pooling of unique probes obtained from the genes (unigenes), and the screening of the BAC library using the pools in a hybridization experiment. Since several probes can hybridize to the same BAC, in order for the deconvolution to be achievable the pooling design has to be able to handle a large number of positives. As a consequence, smaller pools need to be designed which in turn increases the number of hybridization experiments possibly making the entire protocol unfeasible.

Results: We propose a new algorithm that is capable of producing high accuracy deconvolution even in the presence of a weak pooling design, i.e., when pools are rather large. The algorithm compensates for the decrease of information in the hybridization data by taking advantage of a physical map of the BAC clones. We show that the right combination of combinatorial pooling and our algorithm not only dramatically reduces the number of pools required, but also successfully deconvolutes the BAC-gene relationships with almost perfect accuracy.

动机:BAC克隆和基因之间关系的反褶积是基因组中感兴趣区域选择性测序的关键步骤。它通常需要从基因(unigenes)中获得独特探针的组合池,并在杂交实验中使用池筛选BAC文库。由于多个探针可以杂交到相同的BAC,为了实现反卷积,池化设计必须能够处理大量的阳性。因此,需要设计更小的池,这反过来又增加了杂交实验的数量,可能使整个方案不可行。结果:我们提出了一种新的算法,即使在存在弱池设计的情况下,即当池相当大时,也能够产生高精度的反褶积。该算法通过利用BAC克隆的物理图谱来补偿杂交数据中信息的减少。我们的研究表明,组合池和我们的算法的正确组合不仅大大减少了所需池的数量,而且还以几乎完美的精度成功地反卷积了bac -基因关系。
{"title":"Deconvoluting the BAC-gene relationships using a physical map.","authors":"Yonghui Wu,&nbsp;Lan Liu,&nbsp;Timothy J Close,&nbsp;Stefano Lonardi","doi":"","DOIUrl":"","url":null,"abstract":"<p><strong>Motivation: </strong>The deconvolution of the relationships between BAC clones and genes is a crucial step in the selective sequencing of the regions of interest in a genome. It usually requires combinatorial pooling of unique probes obtained from the genes (unigenes), and the screening of the BAC library using the pools in a hybridization experiment. Since several probes can hybridize to the same BAC, in order for the deconvolution to be achievable the pooling design has to be able to handle a large number of positives. As a consequence, smaller pools need to be designed which in turn increases the number of hybridization experiments possibly making the entire protocol unfeasible.</p><p><strong>Results: </strong>We propose a new algorithm that is capable of producing high accuracy deconvolution even in the presence of a weak pooling design, i.e., when pools are rather large. The algorithm compensates for the decrease of information in the hybridization data by taking advantage of a physical map of the BAC clones. We show that the right combination of combinatorial pooling and our algorithm not only dramatically reduces the number of pools required, but also successfully deconvolutes the BAC-gene relationships with almost perfect accuracy.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"203-14"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27061639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A grammar based methodology for structural motif finding in ncRNA database search. 基于语法的ncRNA数据库结构基序查找方法。
Daniel Quest, William Tapprich, Hesham Ali

In recent years, sequence database searching has been conducted through local alignment heuristics, pattern-matching, and comparison of short statistically significant patterns. While these approaches have unlocked many clues as to sequence relationships, they are limited in that they do not provide context-sensitive searching capabilities (e.g. considering pseudoknots, protein binding positions, and complementary base pairs). Stochastic grammars (hidden Markov models HMMs and stochastic context-free grammars SCFG) do allow for flexibility in terms of local context, but the context comes at the cost of increased computational complexity. In this paper we introduce a new grammar based method for searching for RNA motifs that exist within a conserved RNA structure. Our method constrains computational complexity by using a chain of topology elements. Through the use of a case study we present the algorithmic approach and benchmark our approach against traditional methods.

近年来,序列数据库搜索主要通过局部比对启发式、模式匹配、短模式比较等方式进行。虽然这些方法已经解开了许多关于序列关系的线索,但它们的局限性在于它们不提供上下文敏感的搜索功能(例如考虑假结,蛋白质结合位置和互补碱基对)。随机语法(隐马尔可夫模型hmm和随机上下文无关语法SCFG)确实允许在局部上下文方面具有灵活性,但是上下文的代价是增加了计算复杂性。在本文中,我们介绍了一种新的基于语法的方法来搜索存在于保守RNA结构中的RNA基序。我们的方法通过使用拓扑元素链来限制计算复杂度。通过案例研究,我们提出了算法方法,并将我们的方法与传统方法进行了比较。
{"title":"A grammar based methodology for structural motif finding in ncRNA database search.","authors":"Daniel Quest,&nbsp;William Tapprich,&nbsp;Hesham Ali","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In recent years, sequence database searching has been conducted through local alignment heuristics, pattern-matching, and comparison of short statistically significant patterns. While these approaches have unlocked many clues as to sequence relationships, they are limited in that they do not provide context-sensitive searching capabilities (e.g. considering pseudoknots, protein binding positions, and complementary base pairs). Stochastic grammars (hidden Markov models HMMs and stochastic context-free grammars SCFG) do allow for flexibility in terms of local context, but the context comes at the cost of increased computational complexity. In this paper we introduce a new grammar based method for searching for RNA motifs that exist within a conserved RNA structure. Our method constrains computational complexity by using a chain of topology elements. Through the use of a case study we present the algorithmic approach and benchmark our approach against traditional methods.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"215-25"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27061640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An algorithmic approach to automated high-throughput identification of disulfide connectivity in proteins using tandem mass spectrometry. 使用串联质谱法自动高通量鉴定蛋白质中二硫连通性的算法方法。
Timothy Lee, Rahul Singh, T. Yen, B. Macher
Knowledge of the pattern of disulfide linkages in a protein leads to a better understanding of its tertiary structure and biological function. At the state-of-the-art, liquid chromatography/electrospray ionization-tandem mass spectrometry (LC/ESI-MS/MS) can produce spectra of the peptides in a protein that are putatively joined by a disulfide bond. In this setting, efficient algorithms are required for matching the theoretical mass spaces of all possible bonded peptide fragments to the experimentally derived spectra to determine the number and location of the disulfide bonds. The algorithmic solution must also account for issues associated with interpreting experimental data from mass spectrometry, such as noise, isotopic variation, neutral loss, and charge state uncertainty. In this paper, we propose a algorithmic approach to high-throughput disulfide bond identification using data from mass spectrometry, that addresses all the aforementioned issues in a unified framework. The complexity of the proposed solution is of the order of the input spectra. The efficacy and efficiency of the method was validated using experimental data derived from proteins with with diverse disulfide linkage patterns.
了解蛋白质中二硫键的模式有助于更好地理解其三级结构和生物学功能。在最先进的技术中,液相色谱/电喷雾电离串联质谱(LC/ESI-MS/MS)可以产生蛋白质中假定由二硫键连接的肽的光谱。在这种情况下,需要有效的算法将所有可能键合肽片段的理论质量空间与实验导出的光谱相匹配,以确定二硫键的数量和位置。算法解决方案还必须考虑到与解释质谱实验数据相关的问题,如噪声、同位素变化、中性损失和电荷状态不确定性。在本文中,我们提出了一种使用质谱数据进行高通量二硫键鉴定的算法方法,该方法在统一的框架中解决了上述所有问题。所提出的解的复杂度是输入谱的数量级。用不同二硫键模式蛋白的实验数据验证了该方法的有效性和效率。
{"title":"An algorithmic approach to automated high-throughput identification of disulfide connectivity in proteins using tandem mass spectrometry.","authors":"Timothy Lee, Rahul Singh, T. Yen, B. Macher","doi":"10.1142/9781860948732_0009","DOIUrl":"https://doi.org/10.1142/9781860948732_0009","url":null,"abstract":"Knowledge of the pattern of disulfide linkages in a protein leads to a better understanding of its tertiary structure and biological function. At the state-of-the-art, liquid chromatography/electrospray ionization-tandem mass spectrometry (LC/ESI-MS/MS) can produce spectra of the peptides in a protein that are putatively joined by a disulfide bond. In this setting, efficient algorithms are required for matching the theoretical mass spaces of all possible bonded peptide fragments to the experimentally derived spectra to determine the number and location of the disulfide bonds. The algorithmic solution must also account for issues associated with interpreting experimental data from mass spectrometry, such as noise, isotopic variation, neutral loss, and charge state uncertainty. In this paper, we propose a algorithmic approach to high-throughput disulfide bond identification using data from mass spectrometry, that addresses all the aforementioned issues in a unified framework. The complexity of the proposed solution is of the order of the input spectra. The efficacy and efficiency of the method was validated using experimental data derived from proteins with with diverse disulfide linkage patterns.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"41-51"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Exact and heuristic algorithms for weighted cluster editing. 加权聚类编辑的精确启发式算法。
S. Rahmann, T. Wittkop, J. Baumbach, Marcel Martin, A. Truß, Sebastian Böcker
Clustering objects according to given similarity or distance values is a ubiquitous problem in computational biology with diverse applications, e.g., in defining families of orthologous genes, or in the analysis of microarray experiments. While there exists a plenitude of methods, many of them produce clusterings that can be further improved. "Cleaning up" initial clusterings can be formalized as projecting a graph on the space of transitive graphs; it is also known as the cluster editing or cluster partitioning problem in the literature. In contrast to previous work on cluster editing, we allow arbitrary weights on the similarity graph. To solve the so-defined weighted transitive graph projection problem, we present (1) the first exact fixed-parameter algorithm, (2) a polynomial-time greedy algorithm that returns the optimal result on a well-defined subset of "close-to-transitive" graphs and works heuristically on other graphs, and (3) a fast heuristic that uses ideas similar to those from the Fruchterman-Reingold graph layout algorithm. We compare quality and running times of these algorithms on both artificial graphs and protein similarity graphs derived from the 66 organisms of the COG dataset.
根据给定的相似性或距离值聚类对象是计算生物学中具有多种应用的普遍问题,例如,在定义同源基因家族或在微阵列实验分析中。虽然存在大量的方法,但其中许多方法产生的聚类可以进一步改进。“清理”初始聚类可以形式化为在传递图空间上投影一个图;它在文献中也被称为聚类编辑或聚类划分问题。与之前的聚类编辑工作不同,我们允许在相似图上使用任意权重。为了解决这样定义的加权传递图投影问题,我们提出了(1)第一种精确的固定参数算法,(2)一个多项式时间贪婪算法,它在一个定义良好的“接近传递”图子集上返回最优结果,并在其他图上启发式地工作,以及(3)一个使用类似于Fruchterman-Reingold图布局算法的思想的快速启发式算法。我们比较了这些算法在人工图和来自COG数据集的66种生物的蛋白质相似图上的质量和运行时间。
{"title":"Exact and heuristic algorithms for weighted cluster editing.","authors":"S. Rahmann, T. Wittkop, J. Baumbach, Marcel Martin, A. Truß, Sebastian Böcker","doi":"10.1142/9781860948732_0040","DOIUrl":"https://doi.org/10.1142/9781860948732_0040","url":null,"abstract":"Clustering objects according to given similarity or distance values is a ubiquitous problem in computational biology with diverse applications, e.g., in defining families of orthologous genes, or in the analysis of microarray experiments. While there exists a plenitude of methods, many of them produce clusterings that can be further improved. \"Cleaning up\" initial clusterings can be formalized as projecting a graph on the space of transitive graphs; it is also known as the cluster editing or cluster partitioning problem in the literature. In contrast to previous work on cluster editing, we allow arbitrary weights on the similarity graph. To solve the so-defined weighted transitive graph projection problem, we present (1) the first exact fixed-parameter algorithm, (2) a polynomial-time greedy algorithm that returns the optimal result on a well-defined subset of \"close-to-transitive\" graphs and works heuristically on other graphs, and (3) a fast heuristic that uses ideas similar to those from the Fruchterman-Reingold graph layout algorithm. We compare quality and running times of these algorithms on both artificial graphs and protein similarity graphs derived from the 66 organisms of the COG dataset.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"391-401"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Modeling species-genes data for efficient phylogenetic inference. 为有效的系统发育推断建立物种-基因数据模型。
Wenyuan Li, Y. Liu
In recent years, biclique methods have been proposed to construct phylogenetic trees. One of the key steps of these methods is to find complete sub-matrices (without missing entries) from a species-genes data matrix. To enumerate all complete sub-matrices, (17) described an exact algorithm, whose running time is exponential. Furthermore, it generates a large number of complete sub-matrices, many of which may not be used for tree reconstruction. Further investigating and understanding the characteristics of species-genes data may be helpful for discovering complete sub-matrices. Therefore, in this paper, we focus on quantitatively studying and understanding the characteristics of species-genes data, which can be used to guide new algorithm design for efficient phylogenetic inference. In this paper, a mathematical model is constructed to simulate the real species-genes data. The results indicate that sequence-availability probability distributions follow power law, which leads to the skewness and sparseness of the real species-genes data. Moreover, a special structure, called "ladder structure", is discovered in the real species-genes data. This ladder structure is used to identify complete sub-matrices, and more importantly, to reveal overlapping relationships among complete sub-matrices. To discover the distinct ladder structure in real species-genes data, we propose an efficient evolutionary dynamical system, called "generalized replicator dynamics". Two species-genes data sets from green plants are used to illustrate the effectiveness of our model. Empirical study has shown that our model is effective and efficient in understanding species-genes data for phylogenetic inference.
近年来,人们提出了biclique方法来构建系统发育树。这些方法的关键步骤之一是从物种-基因数据矩阵中找到完整的子矩阵(不缺少条目)。为了枚举所有完整子矩阵,(17)描述了一个精确算法,其运行时间是指数的。此外,它生成了大量完整的子矩阵,其中许多可能无法用于树重建。进一步研究和了解物种基因数据的特征,有助于发现完整的亚矩阵。因此,在本文中,我们着重于定量研究和理解物种-基因数据的特征,这可以用来指导新的算法设计,以实现高效的系统发育推断。本文建立了一个数学模型来模拟真实的物种-基因数据。结果表明,序列可用性概率分布服从幂律,这导致了实际物种基因数据的偏性和稀疏性。此外,在真实的物种基因数据中发现了一种特殊的结构,称为“阶梯结构”。该阶梯结构用于识别完全子矩阵,更重要的是揭示完全子矩阵之间的重叠关系。为了发现真实物种-基因数据中独特的阶梯结构,我们提出了一种有效的进化动力系统,称为“广义复制因子动力学”。两个来自绿色植物的物种基因数据集被用来说明我们模型的有效性。实证研究表明,该模型在理解物种-基因数据进行系统发育推断方面是有效的。
{"title":"Modeling species-genes data for efficient phylogenetic inference.","authors":"Wenyuan Li, Y. Liu","doi":"10.1142/9781860948732_0043","DOIUrl":"https://doi.org/10.1142/9781860948732_0043","url":null,"abstract":"In recent years, biclique methods have been proposed to construct phylogenetic trees. One of the key steps of these methods is to find complete sub-matrices (without missing entries) from a species-genes data matrix. To enumerate all complete sub-matrices, (17) described an exact algorithm, whose running time is exponential. Furthermore, it generates a large number of complete sub-matrices, many of which may not be used for tree reconstruction. Further investigating and understanding the characteristics of species-genes data may be helpful for discovering complete sub-matrices. Therefore, in this paper, we focus on quantitatively studying and understanding the characteristics of species-genes data, which can be used to guide new algorithm design for efficient phylogenetic inference. In this paper, a mathematical model is constructed to simulate the real species-genes data. The results indicate that sequence-availability probability distributions follow power law, which leads to the skewness and sparseness of the real species-genes data. Moreover, a special structure, called \"ladder structure\", is discovered in the real species-genes data. This ladder structure is used to identify complete sub-matrices, and more importantly, to reveal overlapping relationships among complete sub-matrices. To discover the distinct ladder structure in real species-genes data, we propose an efficient evolutionary dynamical system, called \"generalized replicator dynamics\". Two species-genes data sets from green plants are used to illustrate the effectiveness of our model. Empirical study has shown that our model is effective and efficient in understanding species-genes data for phylogenetic inference.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"429-40"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Rule-based human gene normalization in biomedical text with confidence estimation. 基于规则的生物医学文本人类基因归一化与置信度估计。
W. Lau, Calvin A. Johnson, Kevin Becker
The ability to identify gene mentions in text and normalize them to the proper unique identifiers is crucial for "down-stream" text mining applications in bioinformatics. We have developed a rule-based algorithm that divides the normalization task into two steps. The first step includes pattern matching for gene symbols and an approximate term searching technique for gene names. Next, the algorithm measures several features based on morphological, statistical, and contextual information to estimate the level of confidence that the correct identifier is selected for a potential mention. Uniqueness, inverse distance, and coverage are three novel features we quantified. The algorithm was evaluated against the BioCreAtIvE datasets. The feature weights were tuned by the Nealder-Mead simplex method. An F-score of .7622 and an AUC (area under the recall-precision curve) of .7461 were achieved on the test data using the set of weights optimized to the training data.
识别文本中提到的基因并将其规范化为适当的唯一标识符的能力对于生物信息学中的“下游”文本挖掘应用至关重要。我们开发了一种基于规则的算法,将规范化任务分为两个步骤。第一步包括基因符号的模式匹配和基因名称的近似术语搜索技术。接下来,该算法基于形态学、统计学和上下文信息测量几个特征,以估计为潜在提及选择正确标识符的置信度。唯一性、逆距离和覆盖是我们量化的三个新特征。根据BioCreAtIvE数据集对该算法进行了评估。采用Nealder-Mead单纯形法对特征权值进行了调整。使用针对训练数据优化的权值集,测试数据的f得分为0.7622,AUC(召回精度曲线下面积)为0.7461。
{"title":"Rule-based human gene normalization in biomedical text with confidence estimation.","authors":"W. Lau, Calvin A. Johnson, Kevin Becker","doi":"10.1142/9781860948732_0037","DOIUrl":"https://doi.org/10.1142/9781860948732_0037","url":null,"abstract":"The ability to identify gene mentions in text and normalize them to the proper unique identifiers is crucial for \"down-stream\" text mining applications in bioinformatics. We have developed a rule-based algorithm that divides the normalization task into two steps. The first step includes pattern matching for gene symbols and an approximate term searching technique for gene names. Next, the algorithm measures several features based on morphological, statistical, and contextual information to estimate the level of confidence that the correct identifier is selected for a potential mention. Uniqueness, inverse distance, and coverage are three novel features we quantified. The algorithm was evaluated against the BioCreAtIvE datasets. The feature weights were tuned by the Nealder-Mead simplex method. An F-score of .7622 and an AUC (area under the recall-precision curve) of .7461 were achieved on the test data using the set of weights optimized to the training data.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"371-9"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
fRMSDPred: predicting local RMSD between structural fragments using sequence information. fRMSDPred:利用序列信息预测结构片段之间的局部RMSD。
Huzefa Rangwala, George Karypis

The effectiveness of comparative modeling approaches for protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this paper focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared to the profile-to-profile scoring schemes.

通过在初始序列-结构比对中加入预测的结构信息,可以大大提高蛋白质结构预测的比较建模方法的有效性。受用于排列蛋白质结构的方法的启发,本文着重于开发用于估计一对蛋白质片段的RMSD值的机器学习方法。这些估计的片段级RMSD值可用于构建对齐,评估对齐的质量,并识别高质量的对齐片段。我们提出了一种算法来解决这个片段级RMSD预测问题,该算法使用基于支持向量回归和分类的监督学习框架,该框架结合了蛋白质谱、预测的二级结构、有效的信息编码方案和新的二阶成对指数核函数。我们的综合实证研究表明,与配置文件到配置文件的评分方案相比,效果更好。
{"title":"fRMSDPred: predicting local RMSD between structural fragments using sequence information.","authors":"Huzefa Rangwala,&nbsp;George Karypis","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The effectiveness of comparative modeling approaches for protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this paper focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared to the profile-to-profile scoring schemes.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"311-22"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27061077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational systems bioinformatics. Computational Systems Bioinformatics Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1