首页 > 最新文献

Proceedings. International Conference on Intelligent Systems for Molecular Biology最新文献

英文 中文
Reducing Mass Degeneracy in SAR by MS by Stable Isotopic Labeling 稳定同位素标记的质谱法还原SAR中的质量简并
Pub Date : 2000-08-19 DOI: 10.1089/106652701300099056
C. Bailey-Kellogg, J. Kelley, Clifford Stein, B. Donald
Mass spectrometry (MS) promises to be an invaluable tool for functional genomics, by supporting low-cost, high-throughput experiments. However, large-scale MS faces the potential problem of mass degeneracy---indistinguishable masses for multiple biopolymer fragments (e.g., from a limited proteolytic digest). This paper studies the tasks of planning and interpreting MS experiments that use selective isotopic labeling, thereby substantially reducing potential mass degeneracy. Our algorithms support an experimental--computational protocol called structure-activity relation by mass spectrometry (SAR by MS) for elucidating the function of protein-DNA and protein-protein complexes. SAR by MS enzymatically cleaves a crosslinked complex and analyzes the resulting mass spectrum for mass peaks of hypothesized fragments. Depending on binding mode, some cleavage sites will be shielded; the absence of anticipated peaks implicates corresponding fragments as either part of the interaction region or inaccessible due to conformational change upon binding. Thus, different mass spectra provide evidence for different structure--activity relations. We address combinatorial and algorithmic questions in the areas of data analysis (constraining binding mode based on mass signature) and experiment planning (determining an isotopic labeling strategy to reduce mass degeneracy and aid data analysis). We explore the computational complexity of these problems, obtaining upper and lower bounds. We report experimental results from implementations of our algorithms.
质谱(MS)通过支持低成本、高通量的实验,有望成为功能基因组学的宝贵工具。然而,大规模MS面临着质量退化的潜在问题——多个生物聚合物片段(例如,来自有限的蛋白水解消化)无法区分的质量。本文研究了计划和解释使用选择性同位素标记的质谱实验的任务,从而大大降低了潜在的质量简并。我们的算法支持一种称为质谱结构-活性关系(SAR by MS)的实验计算方案,用于阐明蛋白质- dna和蛋白质-蛋白质复合物的功能。通过质谱分析合成SAR酶切交联复合物,并分析产生的质谱为假设片段的质量峰。根据结合方式的不同,一些裂解位点会被屏蔽;预期峰的缺失意味着相应的片段要么是相互作用区域的一部分,要么是由于结合时构象的变化而无法进入的。因此,不同的质谱为不同的构效关系提供了证据。我们解决了数据分析(基于质量签名的约束绑定模式)和实验规划(确定同位素标记策略以减少质量简并并辅助数据分析)领域的组合和算法问题。我们探讨了这些问题的计算复杂度,得到了上界和下界。我们报告了算法实现的实验结果。
{"title":"Reducing Mass Degeneracy in SAR by MS by Stable Isotopic Labeling","authors":"C. Bailey-Kellogg, J. Kelley, Clifford Stein, B. Donald","doi":"10.1089/106652701300099056","DOIUrl":"https://doi.org/10.1089/106652701300099056","url":null,"abstract":"Mass spectrometry (MS) promises to be an invaluable tool for functional genomics, by supporting low-cost, high-throughput experiments. However, large-scale MS faces the potential problem of mass degeneracy---indistinguishable masses for multiple biopolymer fragments (e.g., from a limited proteolytic digest). This paper studies the tasks of planning and interpreting MS experiments that use selective isotopic labeling, thereby substantially reducing potential mass degeneracy. Our algorithms support an experimental--computational protocol called structure-activity relation by mass spectrometry (SAR by MS) for elucidating the function of protein-DNA and protein-protein complexes. SAR by MS enzymatically cleaves a crosslinked complex and analyzes the resulting mass spectrum for mass peaks of hypothesized fragments. Depending on binding mode, some cleavage sites will be shielded; the absence of anticipated peaks implicates corresponding fragments as either part of the interaction region or inaccessible due to conformational change upon binding. Thus, different mass spectra provide evidence for different structure--activity relations. We address combinatorial and algorithmic questions in the areas of data analysis (constraining binding mode based on mass signature) and experiment planning (determining an isotopic labeling strategy to reduce mass degeneracy and aid data analysis). We explore the computational complexity of these problems, obtaining upper and lower bounds. We report experimental results from implementations of our algorithms.","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1089/106652701300099056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"60598839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Protein family classification using sparse Markov transducers. 利用稀疏马尔可夫传感器进行蛋白质家族分类。
E Eskin, W N Grundy, Y Singer

In this paper we present a method for classifying proteins into families using sparse Markov transducers (SMTs). Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Because substitutions of amino acids are common in protein families, incorporating wildcards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. We also present efficient data structures to improve the memory usage of the models. We evaluate SMTs by building protein family classifiers using the Pfam database and compare our results to previously published results.

本文提出了一种利用稀疏马尔可夫传感器(SMTs)对蛋白质进行分类的方法。稀疏马尔可夫换能器,类似于概率后缀树,估计在输入序列条件下的概率分布。smt通过允许在条件反射序列中使用通配符来推广概率后缀树。由于氨基酸的替换在蛋白质家族中很常见,因此将通配符纳入模型可显著提高分类性能。我们提出了两种使用smt构建蛋白质家族分类器的模型。我们还提出了有效的数据结构来改善模型的内存使用。我们通过使用Pfam数据库构建蛋白质家族分类器来评估smt,并将我们的结果与先前发表的结果进行比较。
{"title":"Protein family classification using sparse Markov transducers.","authors":"E Eskin,&nbsp;W N Grundy,&nbsp;Y Singer","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In this paper we present a method for classifying proteins into families using sparse Markov transducers (SMTs). Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Because substitutions of amino acids are common in protein families, incorporating wildcards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. We also present efficient data structures to improve the memory usage of the models. We evaluate SMTs by building protein family classifiers using the Pfam database and compare our results to previously published results.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21812146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A practical algorithm for optimal inference of haplotypes from diploid populations. 从二倍体群体中最优推断单倍型的实用算法。
D Gusfield

The next phase of human genomics will involve large-scale screens of populations for significant DNA polymorphisms, notably single nucleotide polymorphisms (SNP's). Dense human SNP maps are currently under construction. However, the utility of those maps and screens will be limited by the fact that humans are diploid, and that it is presently difficult to get separate data on the two "copies". Hence genotype (blended) SNP data will be collected, and the desired haplotype (partitioned) data must then be (partially) inferred. A particular non-deterministic inference algorithm was proposed and studied before SNP data was available, and extensively applied more recently to study the first available SNP data. In this paper, we consider the question of whether we can obtain an efficient, deterministic variant of that method to optimize the obtained inferences. Although we have shown elsewhere that the optimization problem is NP-hard, we present here a practical approach based on (integer) linear programming. The method either returns the optimal answer, and a declaration that it is the optimal, or declares that it has failed to find the optimal. The approach works quickly and correctly, finding the optimal on all simulated data tested, data that is expected to be more demanding than realistic biological data.

人类基因组学的下一阶段将涉及大规模筛选重要的DNA多态性,特别是单核苷酸多态性(SNP)。密集的人类SNP图谱目前正在构建中。然而,这些地图和屏幕的效用将受到人类是二倍体这一事实的限制,而且目前很难获得两个“副本”的单独数据。因此,将收集基因型(混合)SNP数据,然后必须(部分)推断所需的单倍型(分割)数据。在SNP数据可用之前,提出并研究了一种特殊的非确定性推理算法,并在最近广泛应用于研究第一个可用的SNP数据。在本文中,我们考虑的问题是,我们能否得到该方法的一个有效的、确定性的变体来优化得到的推理。虽然我们已经在其他地方证明了优化问题是np困难的,但我们在这里提出了一种基于(整数)线性规划的实用方法。该方法要么返回最优答案,并声明它是最优答案,要么声明它未能找到最优答案。该方法快速而正确地工作,在所有模拟数据测试中找到最优,这些数据比实际的生物数据要求更高。
{"title":"A practical algorithm for optimal inference of haplotypes from diploid populations.","authors":"D Gusfield","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The next phase of human genomics will involve large-scale screens of populations for significant DNA polymorphisms, notably single nucleotide polymorphisms (SNP's). Dense human SNP maps are currently under construction. However, the utility of those maps and screens will be limited by the fact that humans are diploid, and that it is presently difficult to get separate data on the two \"copies\". Hence genotype (blended) SNP data will be collected, and the desired haplotype (partitioned) data must then be (partially) inferred. A particular non-deterministic inference algorithm was proposed and studied before SNP data was available, and extensively applied more recently to study the first available SNP data. In this paper, we consider the question of whether we can obtain an efficient, deterministic variant of that method to optimize the obtained inferences. Although we have shown elsewhere that the optimization problem is NP-hard, we present here a practical approach based on (integer) linear programming. The method either returns the optimal answer, and a declaration that it is the optimal, or declares that it has failed to find the optimal. The approach works quickly and correctly, finding the optimal on all simulated data tested, data that is expected to be more demanding than realistic biological data.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21811344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of yeast's ORF upstream regions by parallel processing, microarrays, and computational methods. 利用并行处理、微阵列和计算方法分析酵母ORF上游区域。
S Hampson, P Baldi, D Kibler, S B Sandmeyer
{"title":"Analysis of yeast's ORF upstream regions by parallel processing, microarrays, and computational methods.","authors":"S Hampson,&nbsp;P Baldi,&nbsp;D Kibler,&nbsp;S B Sandmeyer","doi":"","DOIUrl":"","url":null,"abstract":"","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21811345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A pragmatic information extraction strategy for gathering data on genetic interactions. 一种实用的基因相互作用信息提取策略。
D Proux, F Rechenmann, L Julliard

We present in this paper a pragmatic strategy to perform information extraction from biologic texts. Since the emergence of the information extraction field, techniques have evolved, become more robust and proved their efficiency on specific domains. We are using a combination of existing linguistic and knowledge processing tools to automatically extract information about gene interactions in the literature. Our ultimate goal is to build a network of gene interactions. The methodologies used and the current results are discussed in this paper.

本文提出了一种从生物学文本中进行信息提取的实用策略。自信息提取领域出现以来,技术不断发展,变得更加健壮,并在特定领域证明了其有效性。我们正在使用现有语言和知识处理工具的组合来自动提取文献中有关基因相互作用的信息。我们的最终目标是建立一个基因相互作用的网络。本文讨论了所使用的方法和目前的结果。
{"title":"A pragmatic information extraction strategy for gathering data on genetic interactions.","authors":"D Proux,&nbsp;F Rechenmann,&nbsp;L Julliard","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We present in this paper a pragmatic strategy to perform information extraction from biologic texts. Since the emergence of the information extraction field, techniques have evolved, become more robust and proved their efficiency on specific domains. We are using a combination of existing linguistic and knowledge processing tools to automatically extract information about gene interactions in the literature. Our ultimate goal is to build a network of gene interactions. The methodologies used and the current results are discussed in this paper.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21812559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequence database search using jumping alignments. 序列数据库搜索使用跳跃对齐。
R Spang, M Rehmsmeier, J Stoye

We describe a new algorithm for amino acid sequence classification and the detection of remote homologues. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well balanced manner. This is in contrast to established methods like profiles and hidden Markov models which focus on vertical information as they model the columns of the alignment independently. In our setting, we want to select from a given database of "candidate sequences" those proteins that belong to a given superfamily. In order to do so, each candidate sequence is separately tested against a multiple alignment of the known members of the superfamily by means of a new jumping alignment algorithm. This algorithm is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment. In contrast to traditional methods, however, this alignment is not based on a summary of the individual columns of the multiple alignment. Rather, the candidate sequence at each position is aligned to one sequence of the multiple alignment, called the "reference sequence". In addition, the reference sequence may change within the alignment, while each such jump is penalized. To evaluate the discriminative quality of the jumping alignment algorithm, we compared it to hidden Markov models on a subset of the SCOP database of protein domains. The discriminative quality was assessed by counting the number of false positives that ranked higher than the first true positive (FP-count). For moderate FP-counts above five, the number of successful searches with our method was considerably higher than with hidden Markov models.

我们描述了一种新的氨基酸序列分类和远程同源物检测算法。其基本原理是以一种平衡的方式利用多重对齐的垂直和水平信息。这与profile和隐马尔可夫模型等已建立的方法形成对比,这些方法专注于垂直信息,因为它们独立地对对齐的列进行建模。在我们的设置中,我们希望从给定的“候选序列”数据库中选择属于给定超家族的那些蛋白质。为了做到这一点,每个候选序列通过新的跳跃对齐算法分别针对超家族已知成员的多重对齐进行测试。该算法是Smith-Waterman算法的扩展,计算单个序列的局部对齐和多个对齐。然而,与传统方法相比,这种对齐不是基于多重对齐的单个列的汇总。相反,每个位置的候选序列与多个序列中的一个序列对齐,称为“参考序列”。此外,参考序列可能在对齐中发生变化,而每次这样的跳转都会受到惩罚。为了评估跳跃比对算法的判别质量,我们将其与SCOP蛋白质结构域数据库子集上的隐马尔可夫模型进行了比较。通过计算假阳性高于第一个真阳性的数量(FP-count)来评估鉴别质量。对于大于5的中等fp计数,使用我们的方法成功搜索的次数明显高于使用隐马尔可夫模型。
{"title":"Sequence database search using jumping alignments.","authors":"R Spang,&nbsp;M Rehmsmeier,&nbsp;J Stoye","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We describe a new algorithm for amino acid sequence classification and the detection of remote homologues. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well balanced manner. This is in contrast to established methods like profiles and hidden Markov models which focus on vertical information as they model the columns of the alignment independently. In our setting, we want to select from a given database of \"candidate sequences\" those proteins that belong to a given superfamily. In order to do so, each candidate sequence is separately tested against a multiple alignment of the known members of the superfamily by means of a new jumping alignment algorithm. This algorithm is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment. In contrast to traditional methods, however, this alignment is not based on a summary of the individual columns of the multiple alignment. Rather, the candidate sequence at each position is aligned to one sequence of the multiple alignment, called the \"reference sequence\". In addition, the reference sequence may change within the alignment, while each such jump is penalized. To evaluate the discriminative quality of the jumping alignment algorithm, we compared it to hidden Markov models on a subset of the SCOP database of protein domains. The discriminative quality was assessed by counting the number of false positives that ranked higher than the first true positive (FP-count). For moderate FP-counts above five, the number of successful searches with our method was considerably higher than with hidden Markov models.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21813096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding regulatory elements using joint likelihoods for sequence and expression profile data. 利用序列和表达谱数据的联合似然来寻找调控元件。
I Holmes, W J Bruno

A recent, popular method of finding promoter sequences is to look for conserved motifs upstream of genes clustered on the basis of expression data. This method presupposes that the clustering is correct. Theoretically, one should be better able to find promoter sequences and create more relevant gene clusters by taking a unified approach to these two problems. We present a likelihood function for a "sequence-expression" model giving a joint likelihood for a promoter sequence and its corresponding expression levels. An algorithm to estimate sequence-expression model parameters using Gibbs sampling and Expectation/Maximization is described. A program, called kimono, that implements this algorithm has been developed: the source code is freely available on the Internet.

最近一种流行的寻找启动子序列的方法是寻找基于表达数据聚类的基因上游的保守基序。这种方法的前提是聚类是正确的。从理论上讲,采用统一的方法来解决这两个问题,应该能够更好地找到启动子序列并创建更多相关的基因簇。我们提出了一个“序列-表达”模型的似然函数,给出了启动子序列及其相应表达水平的联合似然。描述了一种利用Gibbs抽样和期望/最大化估计序列表达式模型参数的算法。一个名为“和服”(kimono)的程序已经开发出来,实现了这个算法:源代码可以在互联网上免费获得。
{"title":"Finding regulatory elements using joint likelihoods for sequence and expression profile data.","authors":"I Holmes,&nbsp;W J Bruno","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>A recent, popular method of finding promoter sequences is to look for conserved motifs upstream of genes clustered on the basis of expression data. This method presupposes that the clustering is correct. Theoretically, one should be better able to find promoter sequences and create more relevant gene clusters by taking a unified approach to these two problems. We present a likelihood function for a \"sequence-expression\" model giving a joint likelihood for a promoter sequence and its corresponding expression levels. An algorithm to estimate sequence-expression model parameters using Gibbs sampling and Expectation/Maximization is described. A program, called kimono, that implements this algorithm has been developed: the source code is freely available on the Internet.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21811346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The conserved exon method for gene finding. 保守外显子法寻找基因。
V Bafna, D H Huson

A new approach to gene finding is introduced called the "Conserved Exon Method" (CEM). It is based on the idea of looking for conserved protein sequences by comparing pairs of DNA sequences, identifying putative exon pairs based on conserved regions and splice junction signals then chaining pairs of putative exons together. It simultaneously predicts gene structures in both human and mouse genomic sequences (or in other pairs of sequences at the appropriate evolutionary distance). Experimental results indicate the potential usefulness of this approach.

介绍一种新的基因发现方法,称为“保守外显子法”(CEM)。它基于通过比较DNA序列对来寻找保守的蛋白质序列,根据保守区域和剪接连接信号识别推定的外显子对,然后将推定的外显子对链接在一起的想法。它同时预测人类和小鼠基因组序列的基因结构(或在适当的进化距离上的其他序列对)。实验结果表明了该方法的潜在有效性。
{"title":"The conserved exon method for gene finding.","authors":"V Bafna,&nbsp;D H Huson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>A new approach to gene finding is introduced called the \"Conserved Exon Method\" (CEM). It is based on the idea of looking for conserved protein sequences by comparing pairs of DNA sequences, identifying putative exon pairs based on conserved regions and splice junction signals then chaining pairs of putative exons together. It simultaneously predicts gene structures in both human and mouse genomic sequences (or in other pairs of sequences at the appropriate evolutionary distance). Experimental results indicate the potential usefulness of this approach.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21811602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A probabilistic learning approach to whole-genome operon prediction. 全基因组操纵子预测的概率学习方法。
M Craven, D Page, J Shavlik, J Bockhorst, J Glasner

We present a computational approach to predicting operons in the genomes of prokaryotic organisms. Our approach uses machine learning methods to induce predictive models for this task from a rich variety of data types including sequence data, gene expression data, and functional annotations associated with genes. We use multiple learned models that individually predict promoters, terminators and operons themselves. A key part of our approach is a dynamic programming method that uses our predictions to map every known and putative gene in a given genome into its most probable operon. We evaluate our approach using data from the E. coli K-12 genome.

我们提出了一种计算方法来预测原核生物基因组中的操纵子。我们的方法使用机器学习方法从丰富的数据类型(包括序列数据、基因表达数据和与基因相关的功能注释)中推导出该任务的预测模型。我们使用多个学习模型来单独预测启动子、终止子和操作子本身。我们方法的一个关键部分是动态规划方法,它使用我们的预测将给定基因组中每个已知和假定的基因映射到其最可能的操纵子中。我们使用大肠杆菌K-12基因组的数据来评估我们的方法。
{"title":"A probabilistic learning approach to whole-genome operon prediction.","authors":"M Craven,&nbsp;D Page,&nbsp;J Shavlik,&nbsp;J Bockhorst,&nbsp;J Glasner","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We present a computational approach to predicting operons in the genomes of prokaryotic organisms. Our approach uses machine learning methods to induce predictive models for this task from a rich variety of data types including sequence data, gene expression data, and functional annotations associated with genes. We use multiple learned models that individually predict promoters, terminators and operons themselves. A key part of our approach is a dynamic programming method that uses our predictions to map every known and putative gene in a given genome into its most probable operon. We evaluate our approach using data from the E. coli K-12 genome.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21812144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Regulatory element detection using a probabilistic segmentation model. 基于概率分割模型的调控元素检测。
H J Bussemaker, H Li, E D Siggia

The availability of genome-wide mRNA expression data for organisms whose genome is fully sequenced provides a unique data set from which to decipher how transcription is regulated by the upstream control region of a gene. A new algorithm is presented which decomposes DNA sequence into the most probable "dictionary" of motifs or words. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced from the frequency of shorter words of various length. This eliminates the need for a separate set of reference data to define probabilities, and genome-wide applications are therefore possible. For the 6,000 upstream regulatory regions in the yeast genome, the 500 strongest motifs from a dictionary of size 1,200 match at a significance level of 15 standard deviations to a database of cis-regulatory elements. Analysis of sets of genes such as those up-regulated during sporulation reveals many new putative regulatory sites in addition to identifying previously known sites.

对于基因组完全测序的生物体,全基因组mRNA表达数据的可用性提供了一个独特的数据集,从中可以破译转录是如何由基因的上游控制区调节的。提出了一种将DNA序列分解成最可能的基序或词“字典”的新算法。词的识别是基于一个概率分割模型,在这个模型中,长词的重要性是由不同长度的短词的频率推断出来的。这消除了对一组单独的参考数据来定义概率的需要,因此全基因组应用是可能的。在酵母基因组的6000个上游调控区域中,从1200个字典中筛选出的500个最强的基序与顺式调控元件数据库的显著性水平为15个标准差。对一系列基因的分析,例如在产孢过程中上调的基因,除了确定先前已知的位点外,还揭示了许多新的假定的调控位点。
{"title":"Regulatory element detection using a probabilistic segmentation model.","authors":"H J Bussemaker,&nbsp;H Li,&nbsp;E D Siggia","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The availability of genome-wide mRNA expression data for organisms whose genome is fully sequenced provides a unique data set from which to decipher how transcription is regulated by the upstream control region of a gene. A new algorithm is presented which decomposes DNA sequence into the most probable \"dictionary\" of motifs or words. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced from the frequency of shorter words of various length. This eliminates the need for a separate set of reference data to define probabilities, and genome-wide applications are therefore possible. For the 6,000 upstream regulatory regions in the yeast genome, the 500 strongest motifs from a dictionary of size 1,200 match at a significance level of 15 standard deviations to a database of cis-regulatory elements. Analysis of sets of genes such as those up-regulated during sporulation reveals many new putative regulatory sites in addition to identifying previously known sites.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"21812198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings. International Conference on Intelligent Systems for Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1