Proceedings. IEEE Computational Systems Bioinformatics Conference最新文献

英文中文

Peptide charge state determination for low-resolution tandem mass spectra. 低分辨率串联质谱中多肽电荷态的测定。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.44

Aaron A Klammer, Christine C Wu, Michael J MacCoss, William Stafford Noble

Mass spectrometry is a particularly useful technology for the rapid and robust identification of peptides and proteins in complex mixtures. Peptide sequences can be identified by correlating their observed tandem mass spectra (MS/MS) with theoretical spectra of peptides from a sequence database. Unfortunately, to perform this search the charge of the peptide must be known, and current chargestate- determination algorithms only discriminate singlyfrom multiply-charged spectra: distinguishing +2 from +3, for example, is unreliable. Thus, search software is forced to search multiply-charged spectra multiple times. To minimize this inefficiency, we present a support vector machine (SVM) that quickly and reliably classifies multiplycharged spectra as having either a +2 or +3 precursor peptide ion. By classifying multiply-charged spectra, we obtain a 40% reduction in search time while maintaining an average of 99% of peptide and 99% of protein identifications originally obtained from these spectra.

质谱法是一种特别有用的技术，可以快速、可靠地鉴定复杂混合物中的多肽和蛋白质。肽序列可以通过将观察到的串联质谱(MS/MS)与序列数据库中肽的理论谱相关联来鉴定。不幸的是，要进行这种搜索，肽的电荷必须是已知的，而目前的电荷状态测定算法只能区分单电荷和多电荷光谱:例如，区分+2和+3是不可靠的。因此，搜索软件被迫多次搜索多电荷谱。为了最大限度地降低这种低效率，我们提出了一种支持向量机(SVM)，该支持向量机可以快速可靠地将多电荷光谱分类为具有+2或+3前体肽离子。通过对多电荷光谱进行分类，我们的搜索时间减少了40%，同时保持了从这些光谱中获得的99%的肽和99%的蛋白质鉴定的平均值。

引用次数: 36

An efficient algorithm for Perfect Phylogeny Haplotyping. 完美系统发育单倍型的高效算法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.12

Ravi Vijayasatya, Amar Mukherjee

The Perfect Phylogeny Haplotyping (PPH) problem is one of the many computational approaches to the Haplotype Inference (HI) problem. Though there are many O(nm(2)) solutions to the PPH problem, the complexity of the PPH problem itself has remained an open question. In this paper, We introduce the FlexTree data structure that represents all the solutions for a PPH instance. We also introduce row-ordering that arranges the genotypes in a more manageable fashion. The column ordering, the FlexTree data structure and the row ordering together make the O(nm) OPPH algorithm possible. We also present some results on simulated data which demonstrate that the OPPH algorithm performs quiet impressively when compared to the earlier O(nm(2)) algorithms.

完美系统发育单倍型(PPH)问题是解决单倍型推断(HI)问题的众多计算方法之一。虽然PPH问题有许多0 (nm(2))的解决方案，但PPH问题本身的复杂性仍然是一个悬而未决的问题。在本文中，我们介绍了代表PPH实例的所有解决方案的FlexTree数据结构。我们还介绍了以更易于管理的方式排列基因型的行排序。列排序、FlexTree数据结构和行排序共同使O(nm) OPPH算法成为可能。我们还提供了一些模拟数据的结果，这些结果表明，与早期的O(nm(2))算法相比，OPPH算法的性能令人印象深刻。

引用次数: 7

Gene teams with relaxed proximity constraint. 具有宽松接近约束的基因团队。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.33

Sun Kim, Jeong-Hyeon Choi, Jiong Yang

Functionally related genes co-evolve, probably due to the strong selection pressure in evolution. Thus we expect that they are present in multiple genomes. Physical proximity among genes, known as gene team, is a very useful concept to discover functionally related genes in multiple genomes. However, there are also many gene sets that do not preserve physical proximity. In this paper, we generalized the gene team model, that looks for gene clusters in a physically clustered form, to multiple genome cases with relaxed constraint. We propose a novel hybrid pattern model that combines the set and the sequential pattern models. Our model searches for gene clusters with and/or without physical proximity constraint. This model is implemented and tested with 97 genomes (120 replicons). The result was analyzed to show the usefulness of our model. Especially, analysis of gene clusters that belong to B. subtilis and E. coli demonstrated that our model predicted many experimentally verified operons and functionally related clusters. Our program is fast enough to provide a sevice on the web at http://platcom. informatics.indiana.edu/platcom/. Users can select any combination of 97 genomes to predict gene teams.

功能相关的基因共同进化，可能是由于进化过程中强大的选择压力。因此，我们预计它们存在于多个基因组中。基因间的物理接近性，即基因团队，是发现多个基因组中功能相关基因的一个非常有用的概念。然而，也有许多基因组不能保持物理上的接近。本文将以物理聚类形式寻找基因簇的基因团队模型推广到具有宽松约束的多基因组案例中。提出了一种集模式和序列模式相结合的混合模式模型。我们的模型搜索有和/或没有物理接近约束的基因簇。该模型在97个基因组(120个复制子)上进行了实现和测试。对结果进行了分析，以表明我们模型的有效性。特别是，对枯草芽孢杆菌和大肠杆菌基因簇的分析表明，我们的模型预测了许多实验验证的操作子和功能相关簇。我们的程序足够快，可以在http://platcom网站上提供服务。informatics.indiana.edu/platcom/。用户可以选择97个基因组的任意组合来预测基因团队。

{"title":"Gene teams with relaxed proximity constraint.","authors":"Sun Kim, Jeong-Hyeon Choi, Jiong Yang","doi":"10.1109/csb.2005.33","DOIUrl":"https://doi.org/10.1109/csb.2005.33","url":null,"abstract":"Functionally related genes co-evolve, probably due to the strong selection pressure in evolution. Thus we expect that they are present in multiple genomes. Physical proximity among genes, known as gene team, is a very useful concept to discover functionally related genes in multiple genomes. However, there are also many gene sets that do not preserve physical proximity. In this paper, we generalized the gene team model, that looks for gene clusters in a physically clustered form, to multiple genome cases with relaxed constraint. We propose a novel hybrid pattern model that combines the set and the sequential pattern models. Our model searches for gene clusters with and/or without physical proximity constraint. This model is implemented and tested with 97 genomes (120 replicons). The result was analyzed to show the usefulness of our model. Especially, analysis of gene clusters that belong to B. subtilis and E. coli demonstrated that our model predicted many experimentally verified operons and functionally related clusters. Our program is fast enough to provide a sevice on the web at http://platcom. informatics.indiana.edu/platcom/. Users can select any combination of 97 genomes to predict gene teams.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"44-55"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.33","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Computational method for temporal pattern discovery in biomedical genomic databases. 生物医学基因组数据库中时间模式发现的计算方法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.25

Mohammed I Rafiq, Martin J O'Connor, Amar K Das

With the rapid growth of biomedical research databases, opportunities for scientific inquiry have expanded quickly and led to a demand for computational methods that can extract biologically relevant patterns among vast amounts of data. A significant challenge is identifying temporal relationships among genotypic and clinical (phenotypic) data. Few software tools are available for such pattern matching, and they are not interoperable with existing databases. We are developing and validating a novel software method for temporal pattern discovery in biomedical genomics. In this paper, we present an efficient and flexible query algorithm (called TEMF) to extract statistical patterns from time-oriented relational databases. We show that TEMF - as an extension to our modular temporal querying application (Chronus II) - can express a wide range of complex temporal aggregations without the need for data processing in a statistical software package. We show the expressivity of TEMF using example queries from the Stanford HIV Database.

随着生物医学研究数据库的快速增长，科学探究的机会迅速扩大，导致对能够从大量数据中提取生物学相关模式的计算方法的需求。一个重要的挑战是确定基因型和临床(表型)数据之间的时间关系。很少有软件工具可用于这种模式匹配，并且它们不能与现有数据库互操作。我们正在开发和验证一种新的软件方法，用于生物医学基因组学中的时间模式发现。在本文中，我们提出了一种高效灵活的查询算法(称为TEMF)来从面向时间的关系数据库中提取统计模式。我们展示了TEMF——作为我们的模块化时间查询应用程序(Chronus II)的扩展——可以表达大范围的复杂时间聚合，而不需要在统计软件包中进行数据处理。我们使用来自斯坦福大学HIV数据库的示例查询来展示TEMF的表达性。

引用次数: 11

Investigation into biomedical literature classification using support vector machines. 支持向量机在生物医学文献分类中的应用。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.36

Nalini Polavarapu, Shamkant B Navathe, Ramprasad Ramnarayanan, Abrar ul Haque, Saurav Sahay, Ying Liu

Specific topic search in the PubMed Database, one of the most important information resources for scientific community, presents a big challenge to the users. The researcher typically formulates boolean queries followed by scanning the retrieved records for relevance, which is very time consuming and error prone. We applied Support Vector Machines (SVM) for automatic retrieval of PubMed articles related to Human genome epidemiological research at CDC (Center for disease Control and Prevention). In this paper, we discuss various investigations into biomedical literature classification and analyze the effect of various issues related to the choice of keywords, training sets, kernel functions and parameters for the SVM technique. We report on the various factors above to show that SVM is a viable technique for automatic classification of biomedical literature into topics of interest such as epidemiology, cancer, birth defects etc. In all our experiments, we achieved high values of PPV, sensitivity and specificity.

PubMed数据库是科学界最重要的信息资源之一，在PubMed数据库中进行专题检索给用户带来了很大的挑战。研究人员通常制定布尔查询，然后扫描检索到的记录的相关性，这是非常耗时和容易出错的。应用支持向量机(SVM)自动检索美国疾病控制与预防中心(CDC)与人类基因组流行病学研究相关的PubMed文章。本文讨论了生物医学文献分类的各种研究，并分析了支持向量机技术中关键字、训练集、核函数和参数选择等问题对支持向量机分类的影响。我们报告了上述各种因素，以表明支持向量机是一种可行的技术，用于将生物医学文献自动分类为感兴趣的主题，如流行病学，癌症，出生缺陷等。在我们所有的实验中，我们都获得了很高的PPV值，灵敏度和特异性。

{"title":"Investigation into biomedical literature classification using support vector machines.","authors":"Nalini Polavarapu, Shamkant B Navathe, Ramprasad Ramnarayanan, Abrar ul Haque, Saurav Sahay, Ying Liu","doi":"10.1109/csb.2005.36","DOIUrl":"https://doi.org/10.1109/csb.2005.36","url":null,"abstract":"Specific topic search in the PubMed Database, one of the most important information resources for scientific community, presents a big challenge to the users. The researcher typically formulates boolean queries followed by scanning the retrieved records for relevance, which is very time consuming and error prone. We applied Support Vector Machines (SVM) for automatic retrieval of PubMed articles related to Human genome epidemiological research at CDC (Center for disease Control and Prevention). In this paper, we discuss various investigations into biomedical literature classification and analyze the effect of various issues related to the choice of keywords, training sets, kernel functions and parameters for the SVM technique. We report on the various factors above to show that SVM is a viable technique for automatic classification of biomedical literature into topics of interest such as epidemiology, cancer, birth defects etc. In all our experiments, we achieved high values of PPV, sensitivity and specificity.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"366-74"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.36","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Discover true association rates in multi-protein complex proteomics data sets. 发现多蛋白复杂蛋白质组学数据集的真实关联率。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.29

Changyu Shen, Lang Li, Jake Yue Chen

Experimental processes to collect and process proteomics data are increasingly complex, while the computational methods to assess the quality and significance of these data remain unsophisticated. These challenges have led to many biological oversights and computational misconceptions. We developed a complete empirical Bayes model to analyze multi-protein complex (MPC) proteomics data derived from peptide mass spectrometry detections of purified protein complex pull-down experiments. Our model considers not only bait-prey associations, but also prey-prey associations missed in previous work. Using our model and a yeast MPC proteomics data set, we estimated that there should be an average of 28 true associations per MPC, almost ten times as high as was previously estimated. For data sets generated to mimic a real proteome, our model achieved on average 80% sensitivity in detecting true associations, as compared with the 3% sensitivity in previous work, while maintaining a comparable false discovery rate of 0.3%.

收集和处理蛋白质组学数据的实验过程越来越复杂，而评估这些数据质量和意义的计算方法仍然不成熟。这些挑战导致了许多生物学上的疏忽和计算上的误解。我们建立了一个完整的经验贝叶斯模型来分析多蛋白复合物(MPC)蛋白质组学数据，这些数据来自纯化蛋白复合物的肽质谱检测下拉实验。我们的模型不仅考虑了诱饵-猎物关联，而且考虑了先前工作中遗漏的猎物-猎物关联。使用我们的模型和酵母MPC蛋白质组学数据集，我们估计每个MPC平均应该有28个真正的关联，几乎是之前估计的10倍。对于模拟真实蛋白质组生成的数据集，我们的模型在检测真实关联方面实现了平均80%的灵敏度，而之前工作的灵敏度为3%，同时保持了类似的0.3%的错误发现率。

引用次数: 2

Motif extraction and protein classification. 基序提取和蛋白质分类。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.39

Vered Kunik, Zach Solan, Shimon Edelman, Eytan Ruppin, David Horn

We present a novel unsupervised method for extracting meaningful motifs from biological sequence data. This de novo motif extraction (MEX) algorithm is data driven, finding motifs that are not necessarily over-represented in the data. Applying MEX to the oxidoreductases class of enzymes, containing approximately 7000 enzyme sequences, a relatively small set of motifs is obtained. This set spans a motif-space that is used for functional classification of the enzymes by an SVM classifier. The classification based on MEX motifs surpasses that of two other SVM based methods: SVMProt, a method based on the analysis of physical-chemical properties of a protein generated from its sequence of amino acids, and SVM applied to a Smith-Waterman distances matrix. Our findings demonstrate that the MEX algorithm extracts relevant motifs, supporting a successful sequence-to-function classification.

我们提出了一种新的从生物序列数据中提取有意义基序的无监督方法。这种从头开始的基序提取(MEX)算法是数据驱动的，可以找到不一定在数据中过度表示的基序。将MEX应用于含有大约7000个酶序列的氧化还原酶类酶，得到了相对较少的基序集。该集合跨越一个基元空间，用于支持向量机分类器对酶进行功能分类。基于MEX基序的分类优于其他两种基于SVM的方法:SVMProt，一种基于分析氨基酸序列生成的蛋白质的物理化学性质的方法，以及应用于Smith-Waterman距离矩阵的SVM。我们的研究结果表明，MEX算法提取相关的基序，支持成功的序列到功能分类。

引用次数: 33

An efficient and accurate algorithm for assigning nuclear overhauser effect restraints using a rotamer library ensemble and residual dipolar couplings. 一种利用转子库集合和剩余偶极耦合分配核检修器效应约束的有效而精确的算法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.13

Lincong Wang, Bruce Randall Donald

Nuclear Overhauser effect (NOE) distance restraints are the main experimental data from protein nuclear magnetic resonance (NMR) spectroscopy for computing a complete three dimensional solution structure including sidechain conformations. In general, NOE restraints must be assigned before they can be used in a structure determination program. NOE assignment is very time-consuming to do manually, challenging to fully automate, and has become a key bottleneck for high-throughput NMR structure determination. The difficulty in automated NOE assignment is ambiguity: there can be tens of possible different assignments for an NOE peak based solely on its chemical shifts. Previous automated NOE assignment approaches rely on an ensemble of structures, computed from a subset of all the NOEs, to iteratively filter ambiguous assignments. These algorithms are heuristic in nature, provide no guarantees on solution quality or running time, and are slow in practice. In this paper we present an accurate, efficient NOE assignment algorithm. The algorithm first invokes the algorithm in [30, 29] to compute an accurate backbone structure using only two backbone residual dipolar couplings (RDCs) per residue. The algorithm then filters ambiguous NOE assignments by merging an ensemble of intra-residue vectors from a protein rotamer database, together with internuclear vectors from the computed backbone structure. The protein rotamer database was built from ultra-high resolution structures (<1.0 A) in the Protein Data Bank (PDB). The algorithm has been successfully applied to assign more than 1,700 NOE distance restraints with better than 90% accuracy on the protein human ubiquitin using real experimentally-recorded NMR data. The algorithm assigns these NOE restraints in less than one second on a single-processor workstation.

核Overhauser效应(NOE)距离约束是蛋白质核磁共振(NMR)光谱计算包含侧链构象的完整三维溶液结构的主要实验数据。通常，NOE约束必须在结构确定程序中使用之前进行分配。手动完成NOE分配非常耗时，难以实现完全自动化，已成为高通量NMR结构测定的关键瓶颈。自动化NOE分配的困难在于歧义性:一个NOE峰可能有几十种不同的分配，仅仅基于它的化学位移。以前的自动化NOE分配方法依赖于从所有NOE的子集中计算出的结构集合，以迭代地过滤模棱两可的分配。这些算法本质上是启发式的，不能保证解决方案的质量或运行时间，并且在实践中速度很慢。本文提出了一种准确、高效的NOE分配算法。该算法首先调用[30,29]中的算法，每个残基仅使用两个残基残基偶极耦合(rdc)来计算精确的主干结构。然后，该算法通过合并来自蛋白质旋转体数据库的残基内向量集合以及来自计算的骨干结构的核间向量来过滤模糊的NOE分配。蛋白质旋转体数据库由超高分辨率结构(

{"title":"An efficient and accurate algorithm for assigning nuclear overhauser effect restraints using a rotamer library ensemble and residual dipolar couplings.","authors":"Lincong Wang, Bruce Randall Donald","doi":"10.1109/csb.2005.13","DOIUrl":"https://doi.org/10.1109/csb.2005.13","url":null,"abstract":"Nuclear Overhauser effect (NOE) distance restraints are the main experimental data from protein nuclear magnetic resonance (NMR) spectroscopy for computing a complete three dimensional solution structure including sidechain conformations. In general, NOE restraints must be assigned before they can be used in a structure determination program. NOE assignment is very time-consuming to do manually, challenging to fully automate, and has become a key bottleneck for high-throughput NMR structure determination. The difficulty in automated NOE assignment is ambiguity: there can be tens of possible different assignments for an NOE peak based solely on its chemical shifts. Previous automated NOE assignment approaches rely on an ensemble of structures, computed from a subset of all the NOEs, to iteratively filter ambiguous assignments. These algorithms are heuristic in nature, provide no guarantees on solution quality or running time, and are slow in practice. In this paper we present an accurate, efficient NOE assignment algorithm. The algorithm first invokes the algorithm in [30, 29] to compute an accurate backbone structure using only two backbone residual dipolar couplings (RDCs) per residue. The algorithm then filters ambiguous NOE assignments by merging an ensemble of intra-residue vectors from a protein rotamer database, together with internuclear vectors from the computed backbone structure. The protein rotamer database was built from ultra-high resolution structures (<1.0 A) in the Protein Data Bank (PDB). The algorithm has been successfully applied to assign more than 1,700 NOE distance restraints with better than 90% accuracy on the protein human ubiquitin using real experimentally-recorded NMR data. The algorithm assigns these NOE restraints in less than one second on a single-processor workstation.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"189-202"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.13","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Discriminative discovery of transcription factor binding sites from location data. 从定位数据中鉴别发现转录因子结合位点。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.30

Yuji Kawada, Yasubumi Sakakibara

Motivation: The availability of genome-wide location analyses based on chromatin immunoprecipitation (ChIP) data gives a new insight for in silico analysis of transcriptional regulations.

Results: We propose a novel discriminative discovery framework for precisely identifying transcriptional regulatory motifs from both positive and negative samples (sets of upstream sequences of both bound and unbound genes by a transcription factor (TF)) based on the genome-wide location data. In this framework, our goal is to find such discriminative motifs that best explain the location data in the sense that the motifs precisely discriminate the positive samples from the negative ones. First, in order to discover an initial set of discriminative substrings between positive and negative samples, we apply a decision tree learning method which produces a text-classification tree. We extract several clusters consisting of similar substrings from the internal nodes of the learned tree. Second, we start with initial profile-HMMs constructed from each cluster for representing putative motifs and iteratively refine the profile-HMMs to improve the discrimination accuracies. Our genome-wide experimental results on yeast show that our method successfully identifies the consensus sequences for known TFs in the literature and further presents significant performances for discriminating between positive and negative samples in all the TFs, while most other motif detecting methods show very poor performances on the problem of discriminations. Our learned profile-HMMs also improve false negative predictions of ChIP data.

动机:基于染色质免疫沉淀(ChIP)数据的全基因组定位分析的可用性为转录调控的计算机分析提供了新的见解。结果:我们提出了一个新的鉴别发现框架，用于基于全基因组定位数据精确识别阳性和阴性样本(转录因子(TF)结合和未结合基因的上游序列集)的转录调控基序。在这个框架中，我们的目标是找到这样的判别基序，在基序精确区分阳性样本和阴性样本的意义上，最好地解释位置数据。首先，为了在正样本和负样本之间发现一组初始的判别子串，我们采用决策树学习方法生成文本分类树。我们从学习树的内部节点中提取由相似子串组成的几个簇。其次，我们从每个聚类构建初始轮廓hmm开始，用于表示假定的基序，并迭代改进轮廓hmm以提高识别精度。我们在酵母上的全基因组实验结果表明，我们的方法成功地识别了文献中已知的tf的共识序列，并进一步在所有tf的阳性和阴性样本区分方面表现出显著的性能，而大多数其他基序检测方法在区分问题上表现得非常差。我们学习的侧写- hmm也改善了ChIP数据的假阴性预测。

{"title":"Discriminative discovery of transcription factor binding sites from location data.","authors":"Yuji Kawada, Yasubumi Sakakibara","doi":"10.1109/csb.2005.30","DOIUrl":"https://doi.org/10.1109/csb.2005.30","url":null,"abstract":"Motivation: The availability of genome-wide location analyses based on chromatin immunoprecipitation (ChIP) data gives a new insight for in silico analysis of transcriptional regulations.Results: We propose a novel discriminative discovery framework for precisely identifying transcriptional regulatory motifs from both positive and negative samples (sets of upstream sequences of both bound and unbound genes by a transcription factor (TF)) based on the genome-wide location data. In this framework, our goal is to find such discriminative motifs that best explain the location data in the sense that the motifs precisely discriminate the positive samples from the negative ones. First, in order to discover an initial set of discriminative substrings between positive and negative samples, we apply a decision tree learning method which produces a text-classification tree. We extract several clusters consisting of similar substrings from the internal nodes of the learned tree. Second, we start with initial profile-HMMs constructed from each cluster for representing putative motifs and iteratively refine the profile-HMMs to improve the discrimination accuracies. Our genome-wide experimental results on yeast show that our method successfully identifies the consensus sequences for known TFs in the literature and further presents significant performances for discriminating between positive and negative samples in all the TFs, while most other motif detecting methods show very poor performances on the problem of discriminations. Our learned profile-HMMs also improve false negative predictions of ChIP data.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"86-9"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.30","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Efficient algorithms and software for detection of full-length LTR retrotransposons. 全长LTR反转录转座子检测的有效算法和软件。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.31

Anantharaman Kalyanaraman, Srinivas Aluru

LTR retrotransposons constitute one of the most abundant classes of repetitive elements in eukaryotic genomes. In this paper, we present a new algorithm for detection of full-length LTR retrotransposons in genomic sequences. The algorithm identifies regions in a genomic sequence that show structural characteristics of LTR retrotransposons. Three key components distinguish our algorithm from that of current software - (i) a novel method that preprocesses the entire genomic sequence in linear time and produces high quality pairs of LTR candidates in running time that is constant per pair, (ii) a thorough alignment-based evaluation of candidate pairs to ensure high quality prediction, and (iii) a robust parameter set encompassing both structural constraints and quality controls providing users with a high degree of flexibility. Validation of both our serial and parallel implementations of the algorithm against the yeast genome indicates both superior quality and performance results when compared to existing software.

LTR逆转录转座子是真核生物基因组中最丰富的重复元件之一。本文提出了一种检测基因组序列中LTR逆转录转座子全长的新算法。该算法识别基因组序列中显示LTR反转录转座子结构特征的区域。我们的算法与当前的软件有三个关键的区别:(i)一种新颖的方法，在线性时间内预处理整个基因组序列，并在运行时间内产生每对恒定的高质量LTR候选对，(ii)对候选对进行全面的基于比对的评估，以确保高质量的预测，以及(iii)一个包含结构约束和质量控制的鲁棒参数集，为用户提供高度的灵活性。我们对酵母基因组算法的串行和并行实现验证表明，与现有软件相比，该算法的质量和性能都优于现有软件。

引用次数: 8

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. IEEE Computational Systems Bioinformatics Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀