Computational systems bioinformatics. Computational Systems Bioinformatics Conference最新文献

英文中文

Fast multisegment alignments for temporal expression profiles. 快速多段比对时间表达谱。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0028

A. Smith, M. Craven

We present two heuristics for speeding up a time series alignment algorithm that is related to dynamic time warping (DTW). In previous work, we developed our multisegment alignment algorithm to answer similarity queries for toxicogenomic time-series data. Our multisegment algorithm returns more accurate alignments than DTW at the cost of time complexity; the multisegment algorithm is O(n(5)) whereas DTW is O(n(2)). The first heuristic we present speeds up our algorithm by a constant factor by restricting alignments to a cone shape in alignment space. The second heuristic restricts the alignments considered to those near one returned by a DTW-like method. This heuristic adjusts the time complexity to O(n(3)). Importantly, neither heuristic results in a loss in accuracy.

我们提出了两种启发式算法来加速与动态时间规整(DTW)相关的时间序列对齐算法。在之前的工作中，我们开发了我们的多段比对算法来回答毒物基因组学时间序列数据的相似性查询。我们的多段算法以时间复杂度为代价，返回比DTW更精确的对齐;多段算法是O(n(5))，而DTW是O(n(2))。我们提出的第一个启发式算法通过将对齐限制为对齐空间中的锥形来提高算法的速度。第二种启发式方法将考虑的对齐限制为那些接近dtw方法返回的对齐。这种启发式算法将时间复杂度调整为O(n(3))。重要的是，两种启发式都不会导致准确性的损失。

引用次数: 8

An ORFome assembly approach to metagenomics sequences analysis. ORFome组装方法用于宏基因组序列分析。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Yuzhen Ye, Haixu Tang

Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e., ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increased the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.

宏基因组学是一种新兴的方法，用于对未培养的混合微生物群落进行直接基因组分析。目前元基因组学数据的分析很大程度上依赖于最初为微生物基因组学项目设计的计算工具。组装宏基因组序列的挑战主要来自于群落的短序列和高物种复杂性。或者，单个(短)reads将直接针对已知基因(或蛋白质)的数据库进行搜索，以确定同源序列。后一种方法在识别同源序列时可能灵敏度和特异性较低，这可能进一步影响后续的多样性分析。在本文中，我们提出了一种新的宏基因组数据分析方法，称为宏基因组ORFome组装(metagenomics ORFome Assembly, MetaORFA)。整个计算框架由三个步骤组成。每个来自宏基因组项目的读数将首先用可能编码蛋白质的假定开放阅读框(orf)进行注释。接下来，使用EULER组装方法将预测的orf组装成肽的集合。最后，将组装好的多肽(即ORFome)用于同源物的数据库检索和随后的多样性分析。我们将MetaORFA方法应用于几个覆盖率低的短reads元基因组学数据集。结果表明，MetaORFA即使在reads的序列覆盖率极低的情况下也能产生长肽。因此，ORFome组件显著提高了同源性搜索的敏感性，并可能潜在地改善宏基因组数据的多样性分析。当基因组组装由于低序列覆盖率而无法工作时，这种改进对宏基因组计划特别有用。

{"title":"An ORFome assembly approach to metagenomics sequences analysis.","authors":"Yuzhen Ye, Haixu Tang","doi":"","DOIUrl":"","url":null,"abstract":"Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e., ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increased the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"3-13"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28411958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A probabilistic coding based quantum genetic algorithm for multiple sequence alignment. 基于概率编码的多序列比对量子遗传算法。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Hongwei Huo, Qiaoluan Xie, Xubang Shen, Vojislav Stojkovic

This paper presents an original Quantum Genetic algorithm for Multiple sequence ALIGNment (QGMALIGN) that combines a genetic algorithm and a quantum algorithm. A quantum probabilistic coding is designed for representing the multiple sequence alignment. A quantum rotation gate as a mutation operator is used to guide the quantum state evolution. Six genetic operators are designed on the coding basis to improve the solution during the evolutionary process. The features of implicit parallelism and state superposition in quantum mechanics and the global search capability of the genetic algorithm are exploited to get efficient computation. A set of well known test cases from BAliBASE2.0 is used as reference to evaluate the efficiency of the QGMALIGN optimization. The QGMALIGN results have been compared with the most popular methods (CLUSTALX, SAGA, DIALIGN, SB_PIMA, and QGMALIGN) results. The QGMALIGN results show that QGMALIGN performs well on the presenting biological data. The addition of genetic operators to the quantum algorithm lowers the cost of overall running time.

本文提出了一种结合遗传算法和量子算法的多序列比对量子遗传算法(QGMALIGN)。设计了一种表示多序列对齐的量子概率编码。利用量子旋转门作为突变算子来引导量子态演化。在编码的基础上设计了6个遗传算子，以改进进化过程中的解。利用量子力学中隐式并行性和状态叠加性的特点以及遗传算法的全局搜索能力，实现了高效的计算。参考BAliBASE2.0中一组著名的测试用例来评估QGMALIGN优化的效率。QGMALIGN结果与最流行的方法(CLUSTALX、SAGA、DIALIGN、SB_PIMA和QGMALIGN)结果进行了比较。QGMALIGN的结果表明，QGMALIGN在现有的生物学数据上表现良好。在量子算法中加入遗传算子，降低了总体运行时间成本。

引用次数: 0

A Hausdorff-based NOE assignment algorithm using protein backbone determined from residual dipolar couplings and rotamer patterns. 基于hausdorff的NOE分配算法，利用剩余偶极偶联和旋转体模式确定蛋白质骨架。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Jianyang Zeng, Chittaranjan Tripathy, Pei Zhou, Bruce R Donald

High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn3 + tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol eta) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 A and all-heavy-atom RMSD < 2.5 A from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.

基于溶液核磁共振(NMR)光谱的高通量结构测定在结构基因组学中发挥着重要作用。核磁共振结构测定的主要瓶颈之一是对核磁共振数据的解释，通过将核Overhauser效应(NOE)光谱峰分配给质子对来获得足够数量的精确距离约束。NOE自动赋值的困难主要在于化学位移的共振简并和NOE峰位实验误差的不确定性所产生的模糊性。在本文中，我们提出了一种新的NOE分配算法，称为基于hausdorff的NOE分配(HANA)，该算法从每个残基仅使用两个残余偶极耦合(rdc)计算的高分辨率蛋白质骨架开始，采用基于hausdorff的模式匹配技术，从统计多样化的库中推断每个转子体的实验和反向计算的NOE光谱之间的相似性。并驱动最佳位置特定转子的选择，以过滤模糊NOE分配。我们的算法运行时间为O(tn3 + tn log t)，其中t是每个残基的最大旋转体数量，n是蛋白质的大小。将该算法应用于人类泛素、人类DNA y -聚合酶Eta (pol Eta)锌指结构域和人类Set2-Rpb1相互作用结构域(hSRI)三种蛋白质的生物核磁共振数据，结果表明该算法克服了光谱噪声，分配精度达到90%以上。此外，使用我们的自动化NOE分配计算的最终结构的主干RMSD < 1.7 A，全重原子RMSD < 2.5 A，来自x射线晶体学或传统核磁共振方法确定的参考结构。结果表明，NOE分配算法可以成功地应用于蛋白质核磁共振光谱，获得高质量的结构。

{"title":"A Hausdorff-based NOE assignment algorithm using protein backbone determined from residual dipolar couplings and rotamer patterns.","authors":"Jianyang Zeng, Chittaranjan Tripathy, Pei Zhou, Bruce R Donald","doi":"","DOIUrl":"","url":null,"abstract":"High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn3 + tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol eta) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 A and all-heavy-atom RMSD < 2.5 A from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"169-81"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28337242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures. 利用线性系统与不相邻集合数据结构，从有缺失数据的血统中高效推断单倍型。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Xin Li, Jing Li

We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e., single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O (mn x alpha(n)), where m is the number of loci, n is the number of individuals and alpha is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 10(5)-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.

我们研究了零重组假设下的血统数据单倍型推断问题，这一假设得到了相对较大染色体片段上紧密相连标记（即单核苷酸多态性（SNP））的真实数据的有力支持。我们将基因型约束条件表述为继承变量的线性系统，以严谨的数学方式解决了这一问题。然后，我们利用不相邻集合结构来编码个体间的连接信息，从基因型中检测约束条件，并检查约束条件的一致性。在没有缺失数据的树状血统上，我们的算法可以在近乎线性的时间 O (mn x alpha(n))（其中 m 是位点数，n 是个体数，alpha 是反阿克曼函数）内输出一般解以及全部特定解的数量，这是对现有算法的进一步改进。我们还通过考虑对遗传变量的现有（部分）约束，将这一想法扩展到循环系谱和数据缺失的系谱。该算法已用 C++ 实现，并将纳入我们的 PedPhase 软件包。实验结果表明，该算法能以极高的效率正确识别所有 0 重合解。与其他两种流行算法的比较表明，所提出的算法在各种参数设置下都能实现 10 到 10(5)倍的改进。实验研究还为理论分析提出的复杂度界限提供了经验证据。

{"title":"Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures.","authors":"Xin Li, Jing Li","doi":"","DOIUrl":"","url":null,"abstract":"We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e., single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O (mn x alpha(n)), where m is the number of loci, n is the number of individuals and alpha is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 10(5)-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"297-308"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326667/pdf/nihms231595.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the accurate construction of consensus genetic maps. 论共识遗传图谱的准确构建。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0025

Yonghui Wu, T. Close, S. Lonardi

We study the problem of merging genetic maps, when the individual genetic maps are given as directed acyclic graphs. The problem is to build a consensus map, which includes and is consistent with all (or, the vast majority of) the markers in the individual maps. When markers in the input maps have ordering conflicts, the resulting consensus map will contain cycles. We formulate the problem of resolving cycles in a combinatorial optimization framework, which in turn is expressed as an integer linear program. A faster approximation algorithm is proposed, and an additional speed-up heuristic is developed. According to an extensive set of experimental results, our tool is consistently better than JOINMAP, both in terms of accuracy and running time.

研究了当单个遗传图被给定为有向无环图时，遗传图的合并问题。问题是建立一个共识图，它包括并与单个图中的所有(或绝大多数)标记一致。当输入映射中的标记有顺序冲突时，生成的共识映射将包含循环。我们在组合优化框架中提出了求解循环的问题，而这个问题又被表示为整数线性规划。提出了一种更快的近似算法，并开发了一种附加的加速启发式算法。根据一组广泛的实验结果，我们的工具在准确性和运行时间方面始终优于JOINMAP。

引用次数: 64

Proceedings of Computational Systems Bioinformatics 2008. August 26-29, 2008. Palo Alto, California, USA. 计算系统生物信息学学报2008。2008年8月26日至29日。美国加州帕洛阿尔托。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

引用次数: 0

Improving homology models for protein-ligand binding sites. 改进蛋白质-配体结合位点的同源性模型。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Chris Kauffman, Huzefa Rangwala, George Karypis

In order to improve the prediction of protein-ligand binding sites through homology modeling, we incorporate knowledge of the binding residues into the modeling framework. Residues are identified as binding or nonbinding based on their true labels as well as labels predicted from structure and sequence. The sequence predictions were made using a support vector machine framework which employs a sophisticated window-based kernel. Binding labels are used with a very sensitive sequence alignment method to align the target and template. Relevant parameters governing the alignment process are searched for optimal values. Based on our results, homology models of the binding site can be improved if a priori knowledge of the binding residues is available. For target-template pairs with low sequence identity and high structural diversity our sequence-based prediction method provided sufficient information to realize this improvement.

为了通过同源性建模提高对蛋白质-配体结合位点的预测，我们将结合残基的知识纳入建模框架。根据残基的真实标签以及从结构和序列预测的标签来识别它们是结合的还是非结合的。使用支持向量机框架进行序列预测，该框架采用了复杂的基于窗口的内核。结合标签使用非常敏感的序列比对方法来对准目标和模板。寻找控制对准过程的相关参数的最优值。基于我们的研究结果，如果结合残基的先验知识可用，可以改进结合位点的同源性模型。对于低序列同一性和高结构多样性的目标模板对，基于序列的预测方法提供了足够的信息来实现这一改进。

引用次数: 0

Designing secondary structure profiles for fast ncRNA identification. 设计用于ncRNA快速鉴定的二级结构谱。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0013

Yanni Sun, J. Buhler

Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.

检测基因组DNA中的非编码rna (ncRNAs)是基因组注释的重要组成部分。然而，最广泛使用的ncRNA家族建模工具，协方差模型(CM)，在用于搜索时会产生很高的计算成本。这种成本可以通过使用过滤器来排除不太可能包含感兴趣的ncRNA的序列，仅在可能强烈匹配的地方应用CM来降低。尽管最近取得了一些进展，但设计一种能够检测几乎所有ncRNA实例并排除大多数不相关序列的有效过滤器仍然具有挑战性。这项工作提出了一个系统的程序，将ncRNA家族的CM转换为二级结构剖面(SSP)，这增加了二级结构信息的保守剖面，但仍然可以有效地扫描长序列。我们使用动态规划来估计SSP的灵敏度和FP率，从而产生一个高效的、全自动的滤波器设计算法。我们的实验表明，设计的SSP滤波器可以在对各种ncRNA家族(包括具有和不具有强序列保守性的ncRNA家族)保持高灵敏度的同时，比未过滤的CM搜索获得显著的加速。对于高度结构化的ncRNA家族，包括二级结构保守比单独使用一级序列保守产生更好的性能。

{"title":"Designing secondary structure profiles for fast ncRNA identification.","authors":"Yanni Sun, J. Buhler","doi":"10.1142/9781848162648_0013","DOIUrl":"https://doi.org/10.1142/9781848162648_0013","url":null,"abstract":"Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"145-56"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Combining sequence and structural profiles for protein solvent accessibility prediction. 结合序列和结构特征预测蛋白质溶剂可及性。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0017

R. Bondugula, Dong Xu

Solvent accessibility is an important structural feature for a protein. We propose a new method for solvent accessibility prediction that uses known structure and sequence information more efficiently. We first estimate the relative solvent accessibility of the query protein using fuzzy mean operator from the solvent accessibilities of known structure fragments that have similar sequences to the query protein. We then integrate the estimated solvent accessibility and the position specific scoring matrix of the query protein using a neural network. We tested our method on a large data set consisting of 3386 non-redundant proteins. The comparison with other methods show slightly improved prediction accuracies with our method. The resulting system does need not be re-trained when new data is available. We incorporated our method into the MUPRED system, which is available as a web server at http://digbio.missouri.edu/mupred.

溶剂亲和性是蛋白质的重要结构特征。我们提出了一种利用已知结构和序列信息更有效地预测溶剂可及性的新方法。我们首先利用与查询蛋白序列相似的已知结构片段的溶剂可及性，利用模糊平均算子估计查询蛋白的相对溶剂可及性。然后，我们使用神经网络整合估计的溶剂可及性和查询蛋白的位置特定评分矩阵。我们在包含3386个非冗余蛋白的大型数据集上测试了我们的方法。与其他方法的比较表明，本文方法的预测精度略有提高。当有新的数据可用时，生成的系统不需要重新训练。我们将我们的方法合并到MUPRED系统中，该系统作为web服务器可在http://digbio.missouri.edu/mupred上获得。

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀