首页 > 最新文献

Computational systems bioinformatics. Computational Systems Bioinformatics Conference最新文献

英文 中文
Voting algorithms for the motif finding problem. 基序查找问题的投票算法。
Xiaowen Liu, Bin Ma, Lusheng Wang
UNLABELLED Finding motifs in many sequences is an important problem in computational biology, especially in identification of regulatory motifs in DNA sequences. Let c be a motif sequence. Given a set of sequences, each is planted with a mutated version of c at an unknown position, the motif finding problem is to find these planted motifs and the original c. In this paper, we study the VM model of the planted motif problem, which is proposed by Pevzner and Sze. We give a simple Selecting One Voting algorithm and a more powerful Selecting k Voting algorithm. When the length of motif and the number of input sequences are large enough, we prove that the two algorithms can find the unknown motif consensus with high probability. In the proof, we show why a large number of input sequences is so important for finding motifs, which is believed by most researchers. Experimental results on simulated data also support the claim. Selecting k Voting algorithm is powerful, but computational intensive. To speed up the algorithm, we propose a progressive filtering algorithm, which improves the running time significantly and has good accuracy in finding motifs. Our experimental results show that Selecting k Voting algorithm with progressive filtering performs very well in practice and it outperforms some best known algorithms. AVAILABILITY The software is available upon request.
在许多序列中寻找基序是计算生物学中的一个重要问题,特别是在DNA序列中调节基序的鉴定中。设c为基序序列。给定一组序列,每个序列都在未知位置植入一个突变的c, motif寻找问题就是找到这些植入的motif和原始的c。本文研究了由Pevzner和Sze提出的植入motif问题的VM模型。我们给出了一个简单的选择1投票算法和一个更强大的选择k投票算法。当基序长度和输入序列数量足够大时,我们证明了这两种算法能够以高概率找到未知基序一致性。在证明中,我们展示了为什么大量的输入序列对于寻找基序如此重要,这是大多数研究人员所相信的。模拟数据的实验结果也支持了这一说法。投票算法功能强大,但计算量大。为了提高算法的速度,我们提出了一种递进滤波算法,该算法显著提高了运行时间,并且在寻找基序方面具有良好的准确性。实验结果表明,采用渐进式滤波的选择k投票算法在实践中表现良好,优于一些已知的算法。可用性该软件可根据要求提供。
{"title":"Voting algorithms for the motif finding problem.","authors":"Xiaowen Liu, Bin Ma, Lusheng Wang","doi":"10.1142/9781848162648_0004","DOIUrl":"https://doi.org/10.1142/9781848162648_0004","url":null,"abstract":"UNLABELLED Finding motifs in many sequences is an important problem in computational biology, especially in identification of regulatory motifs in DNA sequences. Let c be a motif sequence. Given a set of sequences, each is planted with a mutated version of c at an unknown position, the motif finding problem is to find these planted motifs and the original c. In this paper, we study the VM model of the planted motif problem, which is proposed by Pevzner and Sze. We give a simple Selecting One Voting algorithm and a more powerful Selecting k Voting algorithm. When the length of motif and the number of input sequences are large enough, we prove that the two algorithms can find the unknown motif consensus with high probability. In the proof, we show why a large number of input sequences is so important for finding motifs, which is believed by most researchers. Experimental results on simulated data also support the claim. Selecting k Voting algorithm is powerful, but computational intensive. To speed up the algorithm, we propose a progressive filtering algorithm, which improves the running time significantly and has good accuracy in finding motifs. Our experimental results show that Selecting k Voting algorithm with progressive filtering performs very well in practice and it outperforms some best known algorithms. AVAILABILITY The software is available upon request.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"37-47"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64001431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An ORFome assembly approach to metagenomics sequences analysis. ORFome组装方法用于宏基因组序列分析。
Yuzhen Ye, Haixu Tang

Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e., ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increased the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.

宏基因组学是一种新兴的方法,用于对未培养的混合微生物群落进行直接基因组分析。目前元基因组学数据的分析很大程度上依赖于最初为微生物基因组学项目设计的计算工具。组装宏基因组序列的挑战主要来自于群落的短序列和高物种复杂性。或者,单个(短)reads将直接针对已知基因(或蛋白质)的数据库进行搜索,以确定同源序列。后一种方法在识别同源序列时可能灵敏度和特异性较低,这可能进一步影响后续的多样性分析。在本文中,我们提出了一种新的宏基因组数据分析方法,称为宏基因组ORFome组装(metagenomics ORFome Assembly, MetaORFA)。整个计算框架由三个步骤组成。每个来自宏基因组项目的读数将首先用可能编码蛋白质的假定开放阅读框(orf)进行注释。接下来,使用EULER组装方法将预测的orf组装成肽的集合。最后,将组装好的多肽(即ORFome)用于同源物的数据库检索和随后的多样性分析。我们将MetaORFA方法应用于几个覆盖率低的短reads元基因组学数据集。结果表明,MetaORFA即使在reads的序列覆盖率极低的情况下也能产生长肽。因此,ORFome组件显著提高了同源性搜索的敏感性,并可能潜在地改善宏基因组数据的多样性分析。当基因组组装由于低序列覆盖率而无法工作时,这种改进对宏基因组计划特别有用。
{"title":"An ORFome assembly approach to metagenomics sequences analysis.","authors":"Yuzhen Ye,&nbsp;Haixu Tang","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e., ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increased the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"3-13"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28411958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A probabilistic coding based quantum genetic algorithm for multiple sequence alignment. 基于概率编码的多序列比对量子遗传算法。
Hongwei Huo, Qiaoluan Xie, Xubang Shen, Vojislav Stojkovic

This paper presents an original Quantum Genetic algorithm for Multiple sequence ALIGNment (QGMALIGN) that combines a genetic algorithm and a quantum algorithm. A quantum probabilistic coding is designed for representing the multiple sequence alignment. A quantum rotation gate as a mutation operator is used to guide the quantum state evolution. Six genetic operators are designed on the coding basis to improve the solution during the evolutionary process. The features of implicit parallelism and state superposition in quantum mechanics and the global search capability of the genetic algorithm are exploited to get efficient computation. A set of well known test cases from BAliBASE2.0 is used as reference to evaluate the efficiency of the QGMALIGN optimization. The QGMALIGN results have been compared with the most popular methods (CLUSTALX, SAGA, DIALIGN, SB_PIMA, and QGMALIGN) results. The QGMALIGN results show that QGMALIGN performs well on the presenting biological data. The addition of genetic operators to the quantum algorithm lowers the cost of overall running time.

本文提出了一种结合遗传算法和量子算法的多序列比对量子遗传算法(QGMALIGN)。设计了一种表示多序列对齐的量子概率编码。利用量子旋转门作为突变算子来引导量子态演化。在编码的基础上设计了6个遗传算子,以改进进化过程中的解。利用量子力学中隐式并行性和状态叠加性的特点以及遗传算法的全局搜索能力,实现了高效的计算。参考BAliBASE2.0中一组著名的测试用例来评估QGMALIGN优化的效率。QGMALIGN结果与最流行的方法(CLUSTALX、SAGA、DIALIGN、SB_PIMA和QGMALIGN)结果进行了比较。QGMALIGN的结果表明,QGMALIGN在现有的生物学数据上表现良好。在量子算法中加入遗传算子,降低了总体运行时间成本。
{"title":"A probabilistic coding based quantum genetic algorithm for multiple sequence alignment.","authors":"Hongwei Huo,&nbsp;Qiaoluan Xie,&nbsp;Xubang Shen,&nbsp;Vojislav Stojkovic","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This paper presents an original Quantum Genetic algorithm for Multiple sequence ALIGNment (QGMALIGN) that combines a genetic algorithm and a quantum algorithm. A quantum probabilistic coding is designed for representing the multiple sequence alignment. A quantum rotation gate as a mutation operator is used to guide the quantum state evolution. Six genetic operators are designed on the coding basis to improve the solution during the evolutionary process. The features of implicit parallelism and state superposition in quantum mechanics and the global search capability of the genetic algorithm are exploited to get efficient computation. A set of well known test cases from BAliBASE2.0 is used as reference to evaluate the efficiency of the QGMALIGN optimization. The QGMALIGN results have been compared with the most popular methods (CLUSTALX, SAGA, DIALIGN, SB_PIMA, and QGMALIGN) results. The QGMALIGN results show that QGMALIGN performs well on the presenting biological data. The addition of genetic operators to the quantum algorithm lowers the cost of overall running time.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"15-26"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Hausdorff-based NOE assignment algorithm using protein backbone determined from residual dipolar couplings and rotamer patterns. 基于hausdorff的NOE分配算法,利用剩余偶极偶联和旋转体模式确定蛋白质骨架。
Jianyang Zeng, Chittaranjan Tripathy, Pei Zhou, Bruce R Donald

High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn3 + tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol eta) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 A and all-heavy-atom RMSD < 2.5 A from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.

基于溶液核磁共振(NMR)光谱的高通量结构测定在结构基因组学中发挥着重要作用。核磁共振结构测定的主要瓶颈之一是对核磁共振数据的解释,通过将核Overhauser效应(NOE)光谱峰分配给质子对来获得足够数量的精确距离约束。NOE自动赋值的困难主要在于化学位移的共振简并和NOE峰位实验误差的不确定性所产生的模糊性。在本文中,我们提出了一种新的NOE分配算法,称为基于hausdorff的NOE分配(HANA),该算法从每个残基仅使用两个残余偶极耦合(rdc)计算的高分辨率蛋白质骨架开始,采用基于hausdorff的模式匹配技术,从统计多样化的库中推断每个转子体的实验和反向计算的NOE光谱之间的相似性。并驱动最佳位置特定转子的选择,以过滤模糊NOE分配。我们的算法运行时间为O(tn3 + tn log t),其中t是每个残基的最大旋转体数量,n是蛋白质的大小。将该算法应用于人类泛素、人类DNA y -聚合酶Eta (pol Eta)锌指结构域和人类Set2-Rpb1相互作用结构域(hSRI)三种蛋白质的生物核磁共振数据,结果表明该算法克服了光谱噪声,分配精度达到90%以上。此外,使用我们的自动化NOE分配计算的最终结构的主干RMSD < 1.7 A,全重原子RMSD < 2.5 A,来自x射线晶体学或传统核磁共振方法确定的参考结构。结果表明,NOE分配算法可以成功地应用于蛋白质核磁共振光谱,获得高质量的结构。
{"title":"A Hausdorff-based NOE assignment algorithm using protein backbone determined from residual dipolar couplings and rotamer patterns.","authors":"Jianyang Zeng,&nbsp;Chittaranjan Tripathy,&nbsp;Pei Zhou,&nbsp;Bruce R Donald","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn3 + tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol eta) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 A and all-heavy-atom RMSD < 2.5 A from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"169-81"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28337242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the accurate construction of consensus genetic maps. 论共识遗传图谱的准确构建。
Yonghui Wu, T. Close, S. Lonardi
We study the problem of merging genetic maps, when the individual genetic maps are given as directed acyclic graphs. The problem is to build a consensus map, which includes and is consistent with all (or, the vast majority of) the markers in the individual maps. When markers in the input maps have ordering conflicts, the resulting consensus map will contain cycles. We formulate the problem of resolving cycles in a combinatorial optimization framework, which in turn is expressed as an integer linear program. A faster approximation algorithm is proposed, and an additional speed-up heuristic is developed. According to an extensive set of experimental results, our tool is consistently better than JOINMAP, both in terms of accuracy and running time.
研究了当单个遗传图被给定为有向无环图时,遗传图的合并问题。问题是建立一个共识图,它包括并与单个图中的所有(或绝大多数)标记一致。当输入映射中的标记有顺序冲突时,生成的共识映射将包含循环。我们在组合优化框架中提出了求解循环的问题,而这个问题又被表示为整数线性规划。提出了一种更快的近似算法,并开发了一种附加的加速启发式算法。根据一组广泛的实验结果,我们的工具在准确性和运行时间方面始终优于JOINMAP。
{"title":"On the accurate construction of consensus genetic maps.","authors":"Yonghui Wu, T. Close, S. Lonardi","doi":"10.1142/9781848162648_0025","DOIUrl":"https://doi.org/10.1142/9781848162648_0025","url":null,"abstract":"We study the problem of merging genetic maps, when the individual genetic maps are given as directed acyclic graphs. The problem is to build a consensus map, which includes and is consistent with all (or, the vast majority of) the markers in the individual maps. When markers in the input maps have ordering conflicts, the resulting consensus map will contain cycles. We formulate the problem of resolving cycles in a combinatorial optimization framework, which in turn is expressed as an integer linear program. A faster approximation algorithm is proposed, and an additional speed-up heuristic is developed. According to an extensive set of experimental results, our tool is consistently better than JOINMAP, both in terms of accuracy and running time.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"285-96"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures. 利用线性系统与不相邻集合数据结构,从有缺失数据的血统中高效推断单倍型。
Xin Li, Jing Li

We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e., single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O (mn x alpha(n)), where m is the number of loci, n is the number of individuals and alpha is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 10(5)-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.

我们研究了零重组假设下的血统数据单倍型推断问题,这一假设得到了相对较大染色体片段上紧密相连标记(即单核苷酸多态性(SNP))的真实数据的有力支持。我们将基因型约束条件表述为继承变量的线性系统,以严谨的数学方式解决了这一问题。然后,我们利用不相邻集合结构来编码个体间的连接信息,从基因型中检测约束条件,并检查约束条件的一致性。在没有缺失数据的树状血统上,我们的算法可以在近乎线性的时间 O (mn x alpha(n))(其中 m 是位点数,n 是个体数,alpha 是反阿克曼函数)内输出一般解以及全部特定解的数量,这是对现有算法的进一步改进。我们还通过考虑对遗传变量的现有(部分)约束,将这一想法扩展到循环系谱和数据缺失的系谱。该算法已用 C++ 实现,并将纳入我们的 PedPhase 软件包。实验结果表明,该算法能以极高的效率正确识别所有 0 重合解。与其他两种流行算法的比较表明,所提出的算法在各种参数设置下都能实现 10 到 10(5)倍的改进。实验研究还为理论分析提出的复杂度界限提供了经验证据。
{"title":"Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures.","authors":"Xin Li, Jing Li","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e., single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O (mn x alpha(n)), where m is the number of loci, n is the number of individuals and alpha is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 10(5)-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"297-308"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326667/pdf/nihms231595.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proceedings of Computational Systems Bioinformatics 2008. August 26-29, 2008. Palo Alto, California, USA. 计算系统生物信息学学报2008。2008年8月26日至29日。美国加州帕洛阿尔托。
{"title":"Proceedings of Computational Systems Bioinformatics 2008. August 26-29, 2008. Palo Alto, California, USA.","authors":"","doi":"","DOIUrl":"","url":null,"abstract":"","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"3-340"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28369404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving homology models for protein-ligand binding sites. 改进蛋白质-配体结合位点的同源性模型。
Chris Kauffman, Huzefa Rangwala, George Karypis

In order to improve the prediction of protein-ligand binding sites through homology modeling, we incorporate knowledge of the binding residues into the modeling framework. Residues are identified as binding or nonbinding based on their true labels as well as labels predicted from structure and sequence. The sequence predictions were made using a support vector machine framework which employs a sophisticated window-based kernel. Binding labels are used with a very sensitive sequence alignment method to align the target and template. Relevant parameters governing the alignment process are searched for optimal values. Based on our results, homology models of the binding site can be improved if a priori knowledge of the binding residues is available. For target-template pairs with low sequence identity and high structural diversity our sequence-based prediction method provided sufficient information to realize this improvement.

为了通过同源性建模提高对蛋白质-配体结合位点的预测,我们将结合残基的知识纳入建模框架。根据残基的真实标签以及从结构和序列预测的标签来识别它们是结合的还是非结合的。使用支持向量机框架进行序列预测,该框架采用了复杂的基于窗口的内核。结合标签使用非常敏感的序列比对方法来对准目标和模板。寻找控制对准过程的相关参数的最优值。基于我们的研究结果,如果结合残基的先验知识可用,可以改进结合位点的同源性模型。对于低序列同一性和高结构多样性的目标模板对,基于序列的预测方法提供了足够的信息来实现这一改进。
{"title":"Improving homology models for protein-ligand binding sites.","authors":"Chris Kauffman,&nbsp;Huzefa Rangwala,&nbsp;George Karypis","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In order to improve the prediction of protein-ligand binding sites through homology modeling, we incorporate knowledge of the binding residues into the modeling framework. Residues are identified as binding or nonbinding based on their true labels as well as labels predicted from structure and sequence. The sequence predictions were made using a support vector machine framework which employs a sophisticated window-based kernel. Binding labels are used with a very sensitive sequence alignment method to align the target and template. Relevant parameters governing the alignment process are searched for optimal values. Based on our results, homology models of the binding site can be improved if a priori knowledge of the binding residues is available. For target-template pairs with low sequence identity and high structural diversity our sequence-based prediction method provided sufficient information to realize this improvement.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"211-22"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28337724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Designing secondary structure profiles for fast ncRNA identification. 设计用于ncRNA快速鉴定的二级结构谱。
Yanni Sun, J. Buhler
Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.
检测基因组DNA中的非编码rna (ncRNAs)是基因组注释的重要组成部分。然而,最广泛使用的ncRNA家族建模工具,协方差模型(CM),在用于搜索时会产生很高的计算成本。这种成本可以通过使用过滤器来排除不太可能包含感兴趣的ncRNA的序列,仅在可能强烈匹配的地方应用CM来降低。尽管最近取得了一些进展,但设计一种能够检测几乎所有ncRNA实例并排除大多数不相关序列的有效过滤器仍然具有挑战性。这项工作提出了一个系统的程序,将ncRNA家族的CM转换为二级结构剖面(SSP),这增加了二级结构信息的保守剖面,但仍然可以有效地扫描长序列。我们使用动态规划来估计SSP的灵敏度和FP率,从而产生一个高效的、全自动的滤波器设计算法。我们的实验表明,设计的SSP滤波器可以在对各种ncRNA家族(包括具有和不具有强序列保守性的ncRNA家族)保持高灵敏度的同时,比未过滤的CM搜索获得显著的加速。对于高度结构化的ncRNA家族,包括二级结构保守比单独使用一级序列保守产生更好的性能。
{"title":"Designing secondary structure profiles for fast ncRNA identification.","authors":"Yanni Sun, J. Buhler","doi":"10.1142/9781848162648_0013","DOIUrl":"https://doi.org/10.1142/9781848162648_0013","url":null,"abstract":"Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"145-56"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
MSDash: mass spectrometry database and search. 质谱数据库和搜索。
Zhan Wu, G. Lajoie, B. Ma
Along with the wide application of mass spectrometry in proteomics, more and more mass spectrometry data are becoming publicly available. Several public mass spectrometry data repositories have been built on the Internet. However, most of these repositories are devoid of effective searching methods. In this paper we describe a new mass spectrometry data library, and a novel method to efficiently index and search in the library for spectra that are similar to a query spectrum. A public online server have been set up and demonstrated outstanding speed and scalability of our methods. Together with the mass spectrometry library, our searching method can improve the protein identification confidence by comparing a spectrum with the ones that are already characterized in the database. The searching method can also be used alone to cluster the similar spectra in a mass spectrometry dataset together, in order to to improve the speed and accuracy of the protein identification or quantification.
随着质谱法在蛋白质组学中的广泛应用,越来越多的质谱数据公开。在互联网上建立了几个公共质谱数据库。然而,这些存储库大多缺乏有效的搜索方法。本文描述了一种新的质谱数据库,并提出了一种新的方法来高效地索引和搜索与查询谱相似的谱库。已经建立了一个公共在线服务器,并证明了我们的方法具有出色的速度和可扩展性。与质谱库一起,我们的搜索方法可以通过与数据库中已经表征的谱进行比较来提高蛋白质鉴定的置信度。该搜索方法也可以单独用于将质谱数据集中的相似谱聚类在一起,以提高蛋白质鉴定或定量的速度和准确性。
{"title":"MSDash: mass spectrometry database and search.","authors":"Zhan Wu, G. Lajoie, B. Ma","doi":"10.1142/9781848162648_0006","DOIUrl":"https://doi.org/10.1142/9781848162648_0006","url":null,"abstract":"Along with the wide application of mass spectrometry in proteomics, more and more mass spectrometry data are becoming publicly available. Several public mass spectrometry data repositories have been built on the Internet. However, most of these repositories are devoid of effective searching methods. In this paper we describe a new mass spectrometry data library, and a novel method to efficiently index and search in the library for spectra that are similar to a query spectrum. A public online server have been set up and demonstrated outstanding speed and scalability of our methods. Together with the mass spectrometry library, our searching method can improve the protein identification confidence by comparing a spectrum with the ones that are already characterized in the database. The searching method can also be used alone to cluster the similar spectra in a mass spectrometry dataset together, in order to to improve the speed and accuracy of the protein identification or quantification.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"63-71"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Computational systems bioinformatics. Computational Systems Bioinformatics Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1