Proceedings. IEEE Computational Systems Bioinformatics Conference最新文献

英文中文

Analysis of a systematic search-based algorithm for determining protein backbone structure from a minimum number of residual dipolar couplings. 基于系统搜索的最小偶极偶联确定蛋白质主链结构的算法分析。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332445

Lincong Wang, Bruce Randall Donald

We have developed an ab initio algorithm for determining a protein backbone structure using global orientational restraints on internuclear vectors derived from residual dipolar couplings (RDCs) measured in one or two different aligning media by solution nuclear magnetic resonance (NMR) spectroscopy [14, 15]. Specifically, the conformation and global orientations of individual secondary structure elements are computed, independently, by an exact solution, systematic search-based minimization algorithm using only 2 RDCs per residue. The systematic search is built upon a quartic equation for computing, exactly and in constant time, the directions of an internuclear vector from RDCs, and linear or quadratic equations for computing the sines and cosines of backbone dihedral (phi, psi) angles from two vectors in consecutive peptide planes. In contrast to heuristic search such as simulated annealing (SA) or Monte-Carlo (MC) used by other NMR structure determination algorithms, our minimization algorithm can be analyzed rigorously in terms of expected algorithmic complexity and the coordinate precision of the protein structure as a function of error in the input data. The algorithm has been successfully applied to compute the backbone structures of three proteins using real NMR data.

我们开发了一种从头算算法，通过溶液核磁共振(NMR)波谱法在一种或两种不同的对准介质中测量残余偶极耦合(rdc)，利用核间矢量的全局取向约束来确定蛋白质骨架结构[14,15]。具体来说，通过精确解、基于系统搜索的最小化算法，每个残差仅使用2个rdc，独立计算单个二级结构元素的构象和全局取向。系统搜索建立在一个四次方程上，用于精确地和在恒定时间内计算来自rdc的核间矢量的方向，以及用于计算连续肽平面中两个矢量的主二面体(phi, psi)角的正弦和余弦的线性或二次方程。与其他核磁共振结构确定算法使用的启发式搜索(如模拟退火(SA)或蒙特卡罗(MC))相比，我们的最小化算法可以根据预期算法复杂性和蛋白质结构的坐标精度作为输入数据误差的函数进行严格分析。该算法已成功应用于三种蛋白质的核磁共振数据的主链结构计算。

{"title":"Analysis of a systematic search-based algorithm for determining protein backbone structure from a minimum number of residual dipolar couplings.","authors":"Lincong Wang, Bruce Randall Donald","doi":"10.1109/csb.2004.1332445","DOIUrl":"https://doi.org/10.1109/csb.2004.1332445","url":null,"abstract":"We have developed an ab initio algorithm for determining a protein backbone structure using global orientational restraints on internuclear vectors derived from residual dipolar couplings (RDCs) measured in one or two different aligning media by solution nuclear magnetic resonance (NMR) spectroscopy [14, 15]. Specifically, the conformation and global orientations of individual secondary structure elements are computed, independently, by an exact solution, systematic search-based minimization algorithm using only 2 RDCs per residue. The systematic search is built upon a quartic equation for computing, exactly and in constant time, the directions of an internuclear vector from RDCs, and linear or quadratic equations for computing the sines and cosines of backbone dihedral (phi, psi) angles from two vectors in consecutive peptide planes. In contrast to heuristic search such as simulated annealing (SA) or Monte-Carlo (MC) used by other NMR structure determination algorithms, our minimization algorithm can be analyzed rigorously in terms of expected algorithmic complexity and the coordinate precision of the protein structure as a function of error in the input data. The algorithm has been successfully applied to compute the backbone structures of three proteins using real NMR data.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"319-30"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332445","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25831034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-knockout genetic network analysis: the Rad6 example. 多敲除基因网络分析:以Rad6为例。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01

Alon Kaufman, Martin Kupiec, Eytan Ruppin

A novel and rigorous Multi-perturbation Shapley Value Analysis (MSA) method has been recently presented [12]. The method addresses the challenge of defining and calculating the functional causal contributions of elements of a biological system. This paper presents the first study applying MSA to the analysis of gene knockout data. The MSA identifies the importance of genes in the Rad6 DNA repair pathway of the yeast S. cerevisiae, quantifying their contributions and characterizing their functional interactions. Incorporating additional biological knowledge, a new functional description of the Rad6 pathway is provided, predicting the existence of additional DNA polymerase and RFC-like complexes. The MSA is the first method for rigorously analyzing multi-knockout experiments, which are likely to soon become a standard and necessary tool for analyzing complex biological systems.

最近提出了一种新的严格的多摄动Shapley值分析(MSA)方法[12]。该方法解决了定义和计算生物系统中各元素的功能因果贡献的挑战。本文首次将MSA应用于基因敲除数据的分析。MSA鉴定了酵母酵母Rad6 DNA修复途径中基因的重要性，量化了它们的贡献并表征了它们的功能相互作用。结合额外的生物学知识，提供了Rad6途径的新功能描述，预测了额外的DNA聚合酶和rfc样复合物的存在。MSA是第一个严格分析多重基因敲除实验的方法，可能很快成为分析复杂生物系统的标准和必要工具。

引用次数: 0

Estimating and improving protein interaction error rates. 估计和改进蛋白质相互作用错误率。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332435

Patrik D'haeseleer, George M Church

High throughput protein interaction data sets have proven to be notoriously noisy. Although it is possible to focus on interactions with higher reliability by using only those that are backed up by two or more lines of evidence, this approach invariably throws out the majority of available data. A more optimal use could be achieved by incorporating the probabilities associated with all available interactions into the analysis. We present a novel method for estimating error rates associated with specific protein interaction data sets, as well as with individual interactions given the data sets in which they appear. As a bonus, we also get an estimate for the total number of protein interactions in yeast. Certain types of false positive results can be identified and removed, resulting in a significant improvement in quality of the data set. For co-purification data sets, we show how we can reach a tradeoff between the "spoke" and "matrix" representation of interactions within co-purified groups of proteins to achieve an optimal false positive error rate.

高通量蛋白质相互作用数据集已被证明是众所周知的嘈杂。虽然可以通过只使用那些有两条或更多证据支持的交互来关注具有更高可靠性的交互，但这种方法总是会抛出大部分可用数据。通过将与所有可用交互作用相关的概率合并到分析中，可以实现更优的使用。我们提出了一种新的方法来估计与特定蛋白质相互作用数据集相关的错误率，以及给定它们出现的数据集的个体相互作用。作为奖励，我们还得到了酵母中蛋白质相互作用总数的估计。可以识别和删除某些类型的假阳性结果，从而显著提高数据集的质量。对于共纯化数据集，我们展示了如何在共纯化蛋白质组内相互作用的“辐条”和“矩阵”表示之间进行权衡，以实现最佳的假阳性错误率。

引用次数: 0

PoPS: a computational tool for modeling and predicting protease specificity. 持久性有机污染物:建模和预测蛋白酶特异性的计算工具。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332450

Sarah E Boyd, Maria Garcia de la Banda, Robert N Pike, James C Whisstock, George B Rudy

Proteases play a fundamental role in the control of intra- and extracellular processes by binding and cleaving specific amino acid sequences. Identifying these targets is extremely challenging. Current computational attempts to predict cleavage sites are limited, representing these amino acid sequences as patterns or frequency matrices. Here we present PoPS, a publicly accessible bioinformatics tool (http://pops.csse.monash.edu.au/) which provides a novel method for building computational models of protease specificity that, while still being based on these amino acid sequences, can be built from any experimental data or expert knowledge available to the user. PoPS specificity models can be used to predict and rank likely cleavages within a single substrate, and within entire proteomes. Other factors, such as the secondary or tertiary structure of the substrate, can be used to screen unlikely sites. Furthermore, the tool also provides facilities to infer, compare and test models, and to store them in a publicly accessible database.

蛋白酶通过结合和切割特定的氨基酸序列，在控制细胞内和细胞外过程中发挥着重要作用。确定这些目标极具挑战性。目前预测切割位点的计算尝试是有限的，将这些氨基酸序列表示为模式或频率矩阵。在这里，我们提出了PoPS，一个可公开访问的生物信息学工具(http://pops.csse.monash.edu.au/)，它提供了一种新的方法来构建蛋白酶特异性的计算模型，虽然仍然基于这些氨基酸序列，但可以从任何实验数据或用户可用的专家知识中构建。持久性有机污染物特异性模型可用于预测单个底物和整个蛋白质组内可能的裂解并对其进行排序。其他因素，如底物的二级或三级结构，可用于筛选不太可能的位点。此外，该工具还提供了推断、比较和测试模型的工具，并将它们存储在可公开访问的数据库中。

{"title":"PoPS: a computational tool for modeling and predicting protease specificity.","authors":"Sarah E Boyd, Maria Garcia de la Banda, Robert N Pike, James C Whisstock, George B Rudy","doi":"10.1109/csb.2004.1332450","DOIUrl":"https://doi.org/10.1109/csb.2004.1332450","url":null,"abstract":"Proteases play a fundamental role in the control of intra- and extracellular processes by binding and cleaving specific amino acid sequences. Identifying these targets is extremely challenging. Current computational attempts to predict cleavage sites are limited, representing these amino acid sequences as patterns or frequency matrices. Here we present PoPS, a publicly accessible bioinformatics tool (http://pops.csse.monash.edu.au/) which provides a novel method for building computational models of protease specificity that, while still being based on these amino acid sequences, can be built from any experimental data or expert knowledge available to the user. PoPS specificity models can be used to predict and rank likely cleavages within a single substrate, and within entire proteomes. Other factors, such as the secondary or tertiary structure of the substrate, can be used to screen unlikely sites. Furthermore, the tool also provides facilities to infer, compare and test models, and to store them in a publicly accessible database.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"372-81"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332450","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Space-conserving optimal DNA-protein alignment. 节省空间的最佳dna -蛋白质比对。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332420

Pang Ko, Mahesh Narayanan, Anantharaman Kalyanaraman, Srinivas Aluru

DNA-protein alignment algorithms can be used to discover coding sequences in a genomic sequence, if the corresponding protein derivatives are known. They can also be used to identify potential coding sequences of a newly sequenced genome, by using proteins from related species. Previously known algorithms either solve a simplified formulation, or sacrifice optimality to achieve practical implementation. In this paper, we present a comprehensive formulation of the DNA-protein alignment problem, and an algorithm to compute the optimal alignment in O(mn) time using only four tables of size (m + 1) x (n + 1), where m and n are the lengths of the DNA and protein sequences, respectively. We also developed a Protein and DNA Alignment program PanDA that implements the proposed solution. Experimental results indicate that our algorithm produces high quality alignments.

如果相应的蛋白质衍生物已知，dna -蛋白质比对算法可用于发现基因组序列中的编码序列。它们还可以通过使用来自相关物种的蛋白质来识别新测序的基因组的潜在编码序列。以前已知的算法要么解决一个简化的公式，要么牺牲最优性来实现实际实现。在本文中，我们提出了一个DNA-蛋白质比对问题的综合公式，以及一种算法，该算法仅使用四个大小为(m + 1) x (n + 1)的表来计算O(mn)时间内的最优比对，其中m和n分别是DNA和蛋白质序列的长度。我们还开发了一个蛋白质和DNA比对程序PanDA来实现所提出的解决方案。实验结果表明，该算法能产生高质量的对齐。

引用次数: 0

Selection of patient samples and genes for outcome prediction. 选择患者样本和基因进行预后预测。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332451

Huiqing Liu, Jinyan Li, Limsoon Wong

Gene expression profiles with clinical outcome data enable monitoring of disease progression and prediction of patient survival at the molecular level. We present a new computational method for outcome prediction. Our idea is to use an informative subset of original training samples. This subset consists of only short-term survivors who died within a short period and long-term survivors who were still alive after a long follow-up time. These extreme training samples yield a clear platform to identify genes whose expression is related to survival. To find relevant genes, we combine two feature selection methods -- entropy measure and Wilcoxon rank sum test -- so that a set of sharp discriminating features are identified. The selected training samples and genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a survival/relapse risk score for drawing Kaplan-Meier survival curves. We apply this method to two data sets: diffuse large-B-cell lymphoma (DLBCL) and primary lung adenocarcinoma. In both cases, patients in high and low risk groups stratified by our risk scores are clearly distinguishable. We also compare our risk scores to some clinical factors, such as International Prognostic Index score for DLBCL analysis and tumor stage information for lung adenocarcinoma. Our results indicate that gene expression profiles combined with carefully chosen learning algorithms can predict patient survival for certain diseases.

具有临床结果数据的基因表达谱能够在分子水平上监测疾病进展和预测患者生存。我们提出了一种新的预测结果的计算方法。我们的想法是使用原始训练样本的信息子集。这个子集只包括短期内死亡的短期幸存者和长期随访后仍然存活的长期幸存者。这些极端的训练样本提供了一个清晰的平台来识别与生存相关的基因表达。为了找到相关的基因，我们结合了两种特征选择方法——熵测度和Wilcoxon秩和检验——从而识别出一组具有明显区别的特征。然后，通过支持向量机将选定的训练样本和基因进行整合，构建预测模型，通过该模型为每个验证样本分配生存/复发风险评分，绘制Kaplan-Meier生存曲线。我们将这种方法应用于两个数据集:弥漫性大b细胞淋巴瘤(DLBCL)和原发性肺腺癌。在这两种情况下，通过我们的风险评分分层的高风险和低风险组患者是明显可区分的。我们还将我们的风险评分与一些临床因素进行比较，例如用于DLBCL分析的国际预后指数评分和用于肺腺癌的肿瘤分期信息。我们的研究结果表明，基因表达谱与精心选择的学习算法相结合，可以预测某些疾病的患者生存率。

{"title":"Selection of patient samples and genes for outcome prediction.","authors":"Huiqing Liu, Jinyan Li, Limsoon Wong","doi":"10.1109/csb.2004.1332451","DOIUrl":"https://doi.org/10.1109/csb.2004.1332451","url":null,"abstract":"Gene expression profiles with clinical outcome data enable monitoring of disease progression and prediction of patient survival at the molecular level. We present a new computational method for outcome prediction. Our idea is to use an informative subset of original training samples. This subset consists of only short-term survivors who died within a short period and long-term survivors who were still alive after a long follow-up time. These extreme training samples yield a clear platform to identify genes whose expression is related to survival. To find relevant genes, we combine two feature selection methods -- entropy measure and Wilcoxon rank sum test -- so that a set of sharp discriminating features are identified. The selected training samples and genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a survival/relapse risk score for drawing Kaplan-Meier survival curves. We apply this method to two data sets: diffuse large-B-cell lymphoma (DLBCL) and primary lung adenocarcinoma. In both cases, patients in high and low risk groups stratified by our risk scores are clearly distinguishable. We also compare our risk scores to some clinical factors, such as International Prognostic Index score for DLBCL analysis and tumor stage information for lung adenocarcinoma. Our results indicate that gene expression profiles combined with carefully chosen learning algorithms can predict patient survival for certain diseases.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"382-92"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332451","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic algorithm for inferring qualitative models of gene regulatory networks. 基因调控网络定性模型的动态推断算法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332448

Zheng Yun, Kwoh Chee Keong

It is still an open problem to identify functional relations with o(N . n(k)) time for any domain[2], where N is the number of learning instances, n is the number of genes (or variables) in the Gene Regulatory Network (GRN) models and k is the indegree of the genes. To solve the problem, we introduce a novel algorithm, DFL (Discrete Function Learning), for reconstructing qualitative models of GRNs from gene expression data in this paper. We analyze its complexity of O(k . N . n(2)) on the average and its data requirements. We also perform experiments on both synthetic and Cho et al. [7] yeast cell cycle gene expression data to validate the efficiency and prediction performance of the DFL algorithm. The experiments of synthetic Boolean networks show that the DFL algorithm is more efficient than current algorithms without loss of prediction performances. The results of yeast cell cycle gene expression data show that the DFL algorithm can identify biologically significant models with reasonable accuracy, sensitivity and high precision with respect to the literature evidences. We further introduce a method called epsilon function to deal with noises in data sets. The experimental results show that the epsilon function method is a good supplement to the DFL algorithm.

确定0 (N)的泛函关系仍然是一个有待解决的问题。n(k))时间对于任意域[2]，其中n为学习实例的数量，n为基因调控网络(GRN)模型中基因(或变量)的数量，k为基因的程度。为了解决这个问题，本文引入了一种新的算法DFL(离散函数学习)，用于从基因表达数据中重建grn的定性模型。我们分析了它的复杂度为0 (k)。N。N(2))的平均值及其数据要求。我们还对synthetic和Cho等[7]酵母细胞周期基因表达数据进行了实验，以验证DFL算法的效率和预测性能。合成布尔网络的实验表明，DFL算法在不损失预测性能的情况下比现有算法效率更高。酵母细胞周期基因表达数据的结果表明，相对于文献证据，DFL算法能够以合理的准确度、灵敏度和较高的精度识别出具有生物学意义的模型。我们进一步介绍了一种称为epsilon函数的方法来处理数据集中的噪声。实验结果表明，epsilon函数法是对DFL算法的一个很好的补充。

{"title":"Dynamic algorithm for inferring qualitative models of gene regulatory networks.","authors":"Zheng Yun, Kwoh Chee Keong","doi":"10.1109/csb.2004.1332448","DOIUrl":"https://doi.org/10.1109/csb.2004.1332448","url":null,"abstract":"It is still an open problem to identify functional relations with o(N . n(k)) time for any domain[2], where N is the number of learning instances, n is the number of genes (or variables) in the Gene Regulatory Network (GRN) models and k is the indegree of the genes. To solve the problem, we introduce a novel algorithm, DFL (Discrete Function Learning), for reconstructing qualitative models of GRNs from gene expression data in this paper. We analyze its complexity of O(k . N . n(2)) on the average and its data requirements. We also perform experiments on both synthetic and Cho et al. [7] yeast cell cycle gene expression data to validate the efficiency and prediction performance of the DFL algorithm. The experiments of synthetic Boolean networks show that the DFL algorithm is more efficient than current algorithms without loss of prediction performances. The results of yeast cell cycle gene expression data show that the DFL algorithm can identify biologically significant models with reasonable accuracy, sensitivity and high precision with respect to the literature evidences. We further introduce a method called epsilon function to deal with noises in data sets. The experimental results show that the epsilon function method is a good supplement to the DFL algorithm.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"353-62"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332448","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weighting features to recognize 3D patterns of electron density in X-ray protein crystallography. 加权特征，以识别三维模式的电子密度在x射线蛋白质晶体学。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332439

Kreshna Gopal, Tod D Romo, James C Sacchettini, Thomas R Ioerger

Feature selection and weighting are central problems in pattern recognition and instance-based learning. In this work, we discuss the challenges of constructing and weighting features to recognize 3D patterns of electron density to determine protein structures. We present SLIDER, a feature-weighting algorithm that adjusts weights iteratively such that patterns that match query instances are better ranked than mismatching ones. Moreover, SLIDER makes judicious choices of weight values to be considered in each iteration, by examining specific weights at which matching and mismatching patterns switch as nearest neighbors to query instances. This approach reduces the space of weight vectors to be searched. We make the following two main observations: (1) SLIDER efficiently generates weights that contribute significantly in the retrieval of matching electron density patterns; (2) the optimum weight vector is sensitive to the distance metric i.e. feature relevance can be, to a certain extent, sensitive to the underlying metric used to compare patterns.

特征选择和加权是模式识别和基于实例学习中的核心问题。在这项工作中，我们讨论了构建和加权特征以识别电子密度的3D模式以确定蛋白质结构的挑战。我们提出了SLIDER，这是一种特征加权算法，它迭代地调整权重，使得匹配查询实例的模式比不匹配的模式排名更好。此外，SLIDER通过检查匹配和不匹配模式切换为查询实例的最近邻的特定权重，明智地选择要在每次迭代中考虑的权重值。这种方法减少了需要搜索的权重向量的空间。我们主要观察到以下两点:(1)SLIDER有效地生成权重，对检索匹配的电子密度模式有重要贡献;(2)最优权重向量对距离度量敏感，即特征相关性在一定程度上对用于模式比较的底层度量敏感。

{"title":"Weighting features to recognize 3D patterns of electron density in X-ray protein crystallography.","authors":"Kreshna Gopal, Tod D Romo, James C Sacchettini, Thomas R Ioerger","doi":"10.1109/csb.2004.1332439","DOIUrl":"https://doi.org/10.1109/csb.2004.1332439","url":null,"abstract":"Feature selection and weighting are central problems in pattern recognition and instance-based learning. In this work, we discuss the challenges of constructing and weighting features to recognize 3D patterns of electron density to determine protein structures. We present SLIDER, a feature-weighting algorithm that adjusts weights iteratively such that patterns that match query instances are better ranked than mismatching ones. Moreover, SLIDER makes judicious choices of weight values to be considered in each iteration, by examining specific weights at which matching and mismatching patterns switch as nearest neighbors to query instances. This approach reduces the space of weight vectors to be searched. We make the following two main observations: (1) SLIDER efficiently generates weights that contribute significantly in the retrieval of matching electron density patterns; (2) the optimum weight vector is sensitive to the distance metric i.e. feature relevance can be, to a certain extent, sensitive to the underlying metric used to compare patterns.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"255-65"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332439","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Profile-based string kernels for remote homology detection and motif extraction. 基于配置文件的字符串核远程同源性检测和基序提取。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332428

Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, Christina Leslie

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" -- short regions of the original profile that contribute almost all the weight of the SVM classification score -- and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.

我们引入了新的基于谱的字符串核，用于支持向量机(svm)的蛋白质分类和远程同源性检测问题。这些核使用概率谱，例如由PSI-BLAST算法产生的概率谱，来定义蛋白质序列上的位置依赖突变邻域，以实现数据中k-长度子序列(“k-mers”)的不精确匹配。通过使用高效的数据结构，一旦获得轮廓，就可以快速计算出核。例如，运行PSI-BLAST以构建pro- files所需的时间明显长于内核计算时间和SVM训练时间。我们提出了基于SCOP数据库的远程同源检测实验，在实验中，我们表明基于配置文件的字符串核与支持向量机分类器一起使用，远远优于最近提出的所有监督支持向量机方法。我们还展示了如何使用学习到的SVM分类器来提取“判别序列基序”(discriminative sequence motifs)——原始剖面的短区域几乎贡献了SVM分类得分的所有权重——并表明这些判别基序对应于蛋白质数据中有意义的结构特征。PSI-BLAST配置文件的使用可以看作是一种半监督学习技术，因为PSI-BLAST利用来自大型序列数据库的未标记数据来构建更多信息的配置文件。最近提出的“聚类核”给出了提高支持向量机蛋白质分类性能的一般半监督方法。我们展示了我们的概要内核结果与集群内核相当，同时为大型数据集提供了更好的可伸缩性。

{"title":"Profile-based string kernels for remote homology detection and motif extraction.","authors":"Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, Christina Leslie","doi":"10.1109/csb.2004.1332428","DOIUrl":"https://doi.org/10.1109/csb.2004.1332428","url":null,"abstract":"We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences (\"k-mers\") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract \"discriminative sequence motifs\" -- short regions of the original profile that contribute almost all the weight of the SVM classification score -- and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented \"cluster kernels\" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"152-60"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332428","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An algorithm for detecting homologues of known structured RNAs in genomes. 一种检测基因组中已知结构rna同源物的算法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2004-01-01 DOI: 10.1109/csb.2004.1332443

Shu-Yun Le, Jacob V Maizel, Kaizhong Zhang

Distinct RNA structures are frequently involved in a wide-range of functions in various biological mechanisms. The three dimensional RNA structures solved by X-ray crystallography and various well-established RNA phylogenetic structures indicate that functional RNAs have characteristic RNA structural motifs represented by specific combinations of base pairings and conserved nucleotides in the loop region. Discovery of well-ordered RNA structures and their homologues in genome-wide searches will enhance our ability to detect the RNA structural motifs and help us to highlight their association with functional and regulatory RNA elements. We present here a novel computer algorithm, HomoStRscan, that takes a single RNA sequence with its secondary structure to search for homologous-RNAs in complete genomes. This novel algorithm completely differs from other currently used search algorithms of homologous structures or structural motifs. For an arbitrary segment (or window) given in the target sequence, that has similar size to the query sequence, HomoStRscan finds the most similar structure to the input query structure and computes the maximal similarity score (MSS) between the two structures. The homologousRNA structures are then statistically inferred from the MSS distribution computed in the target genome. The method provides a flexible, robust and fine search tool for any homologous structural RNAs.

不同的RNA结构经常在各种生物机制中参与广泛的功能。通过x射线晶体学解决的三维RNA结构和各种已建立的RNA系统发育结构表明，功能性RNA具有由碱基对和环区保守核苷酸的特定组合所代表的特征RNA结构基序。在全基因组搜索中发现有序的RNA结构及其同源物将增强我们检测RNA结构基序的能力，并帮助我们突出它们与功能和调控RNA元件的关联。我们在这里提出了一种新的计算机算法，HomoStRscan，它采用单个RNA序列及其二级结构来搜索完整基因组中的同源RNA。该算法与现有的同源结构或结构基序搜索算法完全不同。对于目标序列中给定的与查询序列大小相似的任意段(或窗口)，homstrscan会找到与输入查询结构最相似的结构，并计算这两个结构之间的最大相似度分数(MSS)。同源rna结构，然后统计推断从MSS分布计算的目标基因组。该方法为任何同源结构rna的搜索提供了灵活、稳健和精细的工具。

{"title":"An algorithm for detecting homologues of known structured RNAs in genomes.","authors":"Shu-Yun Le, Jacob V Maizel, Kaizhong Zhang","doi":"10.1109/csb.2004.1332443","DOIUrl":"https://doi.org/10.1109/csb.2004.1332443","url":null,"abstract":"Distinct RNA structures are frequently involved in a wide-range of functions in various biological mechanisms. The three dimensional RNA structures solved by X-ray crystallography and various well-established RNA phylogenetic structures indicate that functional RNAs have characteristic RNA structural motifs represented by specific combinations of base pairings and conserved nucleotides in the loop region. Discovery of well-ordered RNA structures and their homologues in genome-wide searches will enhance our ability to detect the RNA structural motifs and help us to highlight their association with functional and regulatory RNA elements. We present here a novel computer algorithm, HomoStRscan, that takes a single RNA sequence with its secondary structure to search for homologous-RNAs in complete genomes. This novel algorithm completely differs from other currently used search algorithms of homologous structures or structural motifs. For an arbitrary segment (or window) given in the target sequence, that has similar size to the query sequence, HomoStRscan finds the most similar structure to the input query structure and computes the maximal similarity score (MSS) between the two structures. The homologousRNA structures are then statistically inferred from the MSS distribution computed in the target genome. The method provides a flexible, robust and fine search tool for any homologous structural RNAs.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"300-10"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332443","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25831032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. IEEE Computational Systems Bioinformatics Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀