首页 > 最新文献

Proceedings of the ... Asia-Pacific bioinformatics conference最新文献

英文 中文
Trends in Codon and Amino Acid Usage in Human Pathogen Tropheryma Whipplei, the only Known Actinobacteria with Reduced Genome 人类致病菌惠氏滋养菌(唯一已知的基因组减少的放线菌)密码子和氨基酸使用趋势
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0017
Sabyasachi Das, Sandip Paul, C. Dutta
The factors governing codon and amino acid usages in the predicted protein-coding sequences of Tropheryma whipplei TW08/27 and Twist genomes have been analyzed. Multivariate analysis identifies the replicational-transcriptional selection coupled with DNA strand-specific asymmetric mutational bias as a major driving force behind the significant inter-strand variations in synonymous codon usage patterns in T. whipplei genes, while a residual intra-strand synonymous codon bias is imparted by a selection force operating at the level of translation. The strand-specific mutational pressure has little influence on the amino acid usage, for which the mean hydropathy level and aromaticity are the major sources of variation, both having nearly equal impact. In spite of the intracellular life-style, the amino acid usage in highly expressed gene products of T. whipplei follows the cost-minimization hypothesis. Both the genomes under study are characterized by the presence of two distinct groups of membrane-associated genes, products of which exhibit significant differences in primary and potential secondary structures as well as in the propensity of protein disorder.
本文分析了疣虫TW08/27和Twist基因组预测蛋白编码序列密码子和氨基酸使用的影响因素。多变量分析表明,复制-转录选择与DNA链特异性不对称突变偏差是T. whipplei基因中同义密码子使用模式显著差异背后的主要驱动力,而剩余的链内同义密码子偏差是由翻译水平上的选择力赋予的。链特异性突变压力对氨基酸利用的影响不大,平均亲水性和芳香性是主要的变异来源,两者的影响几乎相等。尽管存在胞内生活方式,但在高表达基因产物中氨基酸的使用遵循成本最小化假设。所研究的两个基因组都以存在两组不同的膜相关基因为特征,其产物在一级和潜在的二级结构以及蛋白质紊乱的倾向方面表现出显著差异。
{"title":"Trends in Codon and Amino Acid Usage in Human Pathogen Tropheryma Whipplei, the only Known Actinobacteria with Reduced Genome","authors":"Sabyasachi Das, Sandip Paul, C. Dutta","doi":"10.1142/9781860947292_0017","DOIUrl":"https://doi.org/10.1142/9781860947292_0017","url":null,"abstract":"The factors governing codon and amino acid usages in the predicted protein-coding sequences of Tropheryma whipplei TW08/27 and Twist genomes have been analyzed. Multivariate analysis identifies the replicational-transcriptional selection coupled with DNA strand-specific asymmetric mutational bias as a major driving force behind the significant inter-strand variations in synonymous codon usage patterns in T. whipplei genes, while a residual intra-strand synonymous codon bias is imparted by a selection force operating at the level of translation. The strand-specific mutational pressure has little influence on the amino acid usage, for which the mean hydropathy level and aromaticity are the major sources of variation, both having nearly equal impact. In spite of the intracellular life-style, the amino acid usage in highly expressed gene products of T. whipplei follows the cost-minimization hypothesis. Both the genomes under study are characterized by the presence of two distinct groups of membrane-associated genes, products of which exhibit significant differences in primary and potential secondary structures as well as in the propensity of protein disorder.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76356422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discriminative Detection of Cis-Acting Regulatory Variation From Location Data 基于位置数据的顺式调控变异判别检测
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0012
Yu Kawada, Y. Sakakibara
The interaction between transcription factors and their DNA binding sites plays a key role for understanding gene regulation mechanisms. Recent studies revealed the presence of ifunctional polymorphismi in genes that is dened as regulatory variation measured in transcription levels due to the cis-acting sequence differences. These regulatory variants are assumed to contribute to modulating gene functions. However, computational identica tions of such functional cis-regulatory variants is a much greater challenge than just identifying consensus sequences, because cis-regulatory variants differ by only a few bases from the main consensus sequences, while they have important consequences for organismal phenotype. None of the previous studies have directly addressed this problem. We propose a novel discriminative detection method for precisely identifying transcription factor binding sites and their functional variants from both positive and negative samples (sets of upstream sequences of both bound and unbound genes by a transcription factor) based on the genome-wide location data. Our goal is to nd such discriminative substrings that best explain the location data in the sense that the substrings precisely discriminate the positive samples from the negative ones rather than nding the substrings that are simply over-represented among the positive ones. Our method consists of two steps: First, we apply a decision tree learning method to discover discriminative substrings and a hierarchical relationship among them. Second, we extract a main motif and further a second motif as a cis-regulatory variant by utilizing functional annotations. Our genome-wide experimental results on yeast Saccharomyces cerevisiae show that our method presented signicantly better performances for detecting experimentally veried consensus sequences than current motif detecting methods. In addition, our method has successfully discovered second motifs of putative functional cis-regulatory variants which are associated with genes of different functional annotations, and the correctness of those variants have been veried by expression prole analyses.
转录因子与其DNA结合位点之间的相互作用对理解基因调控机制起着关键作用。最近的研究表明,由于顺式作用序列的差异,在转录水平上测量的调节变异在基因中存在功能多态性。这些调节变异体被认为有助于调节基因功能。然而,这种功能性顺式调控变异的计算识别比识别共识序列更具有挑战性,因为顺式调控变异与主要共识序列仅相差几个碱基,而它们对生物体表型具有重要影响。之前的研究都没有直接解决这个问题。我们提出了一种新的判别检测方法,基于全基因组定位数据,从阳性和阴性样本(转录因子结合和未结合基因的上游序列集)中精确识别转录因子结合位点及其功能变体。我们的目标是找到这样的判别子字符串,它能最好地解释位置数据,因为子字符串能精确地区分正样本和负样本,而不是删除在正样本中过度代表的子字符串。我们的方法包括两个步骤:首先,我们应用决策树学习方法来发现判别子串和它们之间的层次关系。其次,我们利用功能注释提取主基序和第二个基序作为顺式调控变体。我们对酿酒酵母的全基因组实验结果表明,我们的方法在检测实验验证的共识序列方面比现有的基序检测方法表现出明显更好的性能。此外,我们的方法还成功地发现了与不同功能注释基因相关的假定功能顺式调控变异的第二基序,并通过表达序列分析验证了这些变异的正确性。
{"title":"Discriminative Detection of Cis-Acting Regulatory Variation From Location Data","authors":"Yu Kawada, Y. Sakakibara","doi":"10.1142/9781860947292_0012","DOIUrl":"https://doi.org/10.1142/9781860947292_0012","url":null,"abstract":"The interaction between transcription factors and their DNA binding sites plays a key role for understanding gene regulation mechanisms. Recent studies revealed the presence of ifunctional polymorphismi in genes that is dened as regulatory variation measured in transcription levels due to the cis-acting sequence differences. These regulatory variants are assumed to contribute to modulating gene functions. However, computational identica tions of such functional cis-regulatory variants is a much greater challenge than just identifying consensus sequences, because cis-regulatory variants differ by only a few bases from the main consensus sequences, while they have important consequences for organismal phenotype. None of the previous studies have directly addressed this problem. We propose a novel discriminative detection method for precisely identifying transcription factor binding sites and their functional variants from both positive and negative samples (sets of upstream sequences of both bound and unbound genes by a transcription factor) based on the genome-wide location data. Our goal is to nd such discriminative substrings that best explain the location data in the sense that the substrings precisely discriminate the positive samples from the negative ones rather than nding the substrings that are simply over-represented among the positive ones. Our method consists of two steps: First, we apply a decision tree learning method to discover discriminative substrings and a hierarchical relationship among them. Second, we extract a main motif and further a second motif as a cis-regulatory variant by utilizing functional annotations. Our genome-wide experimental results on yeast Saccharomyces cerevisiae show that our method presented signicantly better performances for detecting experimentally veried consensus sequences than current motif detecting methods. In addition, our method has successfully discovered second motifs of putative functional cis-regulatory variants which are associated with genes of different functional annotations, and the correctness of those variants have been veried by expression prole analyses.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91449792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Randomized Algorithm for Learning Mahalanobis Metrics: Application to Classification and Regression of Biological Data 一种学习马氏度量的随机算法:在生物数据分类和回归中的应用
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0025
C. Langmead
We present a randomized algorithm for semi-supervised learning of Mahalanobis metrics over Rn. The inputs to the algorithm are a set, U , of unlabeled points in Rn, a set of pairs of points, S = {(x, y)i};x, y ∈ U , that are known to be similar, and a set of pairs of points, D = {(x, y)i};x, y ∈ U , that are known to be dissimilar. The algorithm randomly samples S, D, and m-dimensional subspaces of Rn and learns a metric for each subspace. The metric over Rn is a linear combination of the subspace metrics. The randomization addresses issues of efficiency and overfitting. Extensions of the algorithm to learning non-linear metrics via kernels, and as a pre-processing step for dimensionality reduction are discussed. The new method is demonstrated on a regression problem (structure-based chemical shift prediction) and a classification problem (predicting clinical outcomes for immunomodulatory strategies for treating severe sepsis).
我们提出了一种随机算法,用于半监督学习的马氏度量在Rn上。算法的输入是Rn中未标记点的集合U,已知相似的点对集合S = {(x, y)i};x, y∈U,以及已知不相似的点对集合D = {(x, y)i};x, y∈U。该算法随机采样Rn的S、D、m维子空间,并为每个子空间学习一个度量。Rn上的度规是子空间度规的线性组合。随机化解决了效率和过拟合的问题。将该算法扩展到通过核学习非线性度量,并作为降维的预处理步骤进行了讨论。新方法在一个回归问题(基于结构的化学位移预测)和一个分类问题(预测治疗严重脓毒症的免疫调节策略的临床结果)上得到了证明。
{"title":"A Randomized Algorithm for Learning Mahalanobis Metrics: Application to Classification and Regression of Biological Data","authors":"C. Langmead","doi":"10.1142/9781860947292_0025","DOIUrl":"https://doi.org/10.1142/9781860947292_0025","url":null,"abstract":"We present a randomized algorithm for semi-supervised learning of Mahalanobis metrics over Rn. The inputs to the algorithm are a set, U , of unlabeled points in Rn, a set of pairs of points, S = {(x, y)i};x, y ∈ U , that are known to be similar, and a set of pairs of points, D = {(x, y)i};x, y ∈ U , that are known to be dissimilar. The algorithm randomly samples S, D, and m-dimensional subspaces of Rn and learns a metric for each subspace. The metric over Rn is a linear combination of the subspace metrics. The randomization addresses issues of efficiency and overfitting. Extensions of the algorithm to learning non-linear metrics via kernels, and as a pre-processing step for dimensionality reduction are discussed. The new method is demonstrated on a regression problem (structure-based chemical shift prediction) and a classification problem (predicting clinical outcomes for immunomodulatory strategies for treating severe sepsis).","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80515708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A More Accurate and Efficient Whole Genome Phylogeny 一个更准确和有效的全基因组系统发育
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0037
P. Chan, T. Lam, S. Yiu
To reconstruct a phylogeny for a given set of species, most of the previous approaches are based on the similarity information derived from a subset of conserved regions (or genes) in the corresponding genomes. In some cases, the regions chosen may not reflect the evolutionary history of the species and may be too restricted to differentiate the species. It is generally believed that the inference could be more accurate if whole genomes are being considered. The best existing solution that makes use of complete genomes was proposed by Henz et al.13 They can construct a phylogeny for 91 prokaryotic genomes in 170 CPU hours with an accuracy of about 70% (based on the measurement of non-trivial splits) while other approaches that use whole genomes can only deal with no more than 20 species. Note that Henz et al. measure the distance between the species using BLASTN which is not primarily designed for whole genome alignment. Also, their approach is not scalable, for example, it probably takes over 1000 CPU hours to construct a phylogeny for all 230 prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of the accuracy of the phylogeny. In this paper, we propose the followings. (1) To evaluate the quality of a phylogeny with respect to a model answer, we suggest to use the concept of the maximum agreement subtree as it can capture the structure of the phylogeny. (2) We propose to use whole genome alignment software (such as MUMmer) to measure the distances between the species and derive an efficient approach to generate these distances. From the experiments on real data sets, we found that our approach is more accurate and more scalable than Henz et al.’s approach. We can construct a phylogenetic tree for the same set of 91 genomes with an accuracy more than 20% higher (with respect to both evaluation measures) in 2 CPU hours (more than 80 times faster than their approach). Also, our approach is scalable and can construct a phylogeny for 230 prokaryotic genomes with accuracy as high as 85% in only 9.5 CPU hours.
为了重建一组给定物种的系统发育,以前的大多数方法都是基于从相应基因组中的保守区域(或基因)子集中获得的相似性信息。在某些情况下,所选择的区域可能不能反映物种的进化史,并且可能太受限制而无法区分物种。人们普遍认为,如果考虑到整个基因组,推断可能会更准确。利用完整基因组的现有最佳解决方案是由Henz等人提出的。13他们可以在170个CPU小时内构建91个原核生物基因组的系统发育,准确率约为70%(基于非琐碎分裂的测量),而使用全基因组的其他方法只能处理不超过20个物种。请注意,Henz等人使用BLASTN测量物种之间的距离,BLASTN主要不是为全基因组比对而设计的。此外,他们的方法不具有可扩展性,例如,为NCBI公布的所有230个原核基因组构建系统发育可能需要超过1000个CPU小时。此外,我们发现非琐碎的分裂只是系统发育准确性的一个粗略指标。在本文中,我们提出以下几点建议。(1)为了评估系统发育相对于模型答案的质量,我们建议使用最大一致子树的概念,因为它可以捕获系统发育的结构。(2)我们建议使用全基因组比对软件(如MUMmer)来测量物种之间的距离,并推导出一种有效的方法来生成这些距离。通过对真实数据集的实验,我们发现我们的方法比Henz等人的方法更准确,更具可扩展性。我们可以在2个CPU小时内(比他们的方法快80倍以上)以超过20%的准确率(相对于两种评估方法)为同一组91个基因组构建系统发育树。此外,我们的方法具有可扩展性,可以在9.5个CPU小时内以高达85%的准确率构建230个原核基因组的系统发育。
{"title":"A More Accurate and Efficient Whole Genome Phylogeny","authors":"P. Chan, T. Lam, S. Yiu","doi":"10.1142/9781860947292_0037","DOIUrl":"https://doi.org/10.1142/9781860947292_0037","url":null,"abstract":"To reconstruct a phylogeny for a given set of species, most of the previous approaches are based on the similarity information derived from a subset of conserved regions (or genes) in the corresponding genomes. In some cases, the regions chosen may not reflect the evolutionary history of the species and may be too restricted to differentiate the species. It is generally believed that the inference could be more accurate if whole genomes are being considered. The best existing solution that makes use of complete genomes was proposed by Henz et al.13 They can construct a phylogeny for 91 prokaryotic genomes in 170 CPU hours with an accuracy of about 70% (based on the measurement of non-trivial splits) while other approaches that use whole genomes can only deal with no more than 20 species. Note that Henz et al. measure the distance between the species using BLASTN which is not primarily designed for whole genome alignment. Also, their approach is not scalable, for example, it probably takes over 1000 CPU hours to construct a phylogeny for all 230 prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of the accuracy of the phylogeny. In this paper, we propose the followings. (1) To evaluate the quality of a phylogeny with respect to a model answer, we suggest to use the concept of the maximum agreement subtree as it can capture the structure of the phylogeny. (2) We propose to use whole genome alignment software (such as MUMmer) to measure the distances between the species and derive an efficient approach to generate these distances. From the experiments on real data sets, we found that our approach is more accurate and more scalable than Henz et al.’s approach. We can construct a phylogenetic tree for the same set of 91 genomes with an accuracy more than 20% higher (with respect to both evaluation measures) in 2 CPU hours (more than 80 times faster than their approach). Also, our approach is scalable and can construct a phylogeny for 230 prokaryotic genomes with accuracy as high as 85% in only 9.5 CPU hours.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74768787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ONBRIRES: Ontology-Based Biological Relation Extraction System 基于本体的生物关系提取系统
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0036
Minlie Huang, Xiaoyan Zhu, Shilin Ding, Hao Yu, Ming Li
Automated discovery and extraction of biological relations from online documents, particularly MEDLINE texts, has become essential and urgent because such literature data are accumulated in a tremendous growth. In this paper, we present an ontology-based framework of biological relation extraction system. This framework is unified and able to extract several kinds of relations such as gene-disease, gene-gene, and protein-protein interactions etc. The main contributions of this paper are that we propose a two-level pattern learning algorithm, and organize patterns hierarchically.
从在线文档,特别是MEDLINE文本中自动发现和提取生物关系已经变得必不可少和紧迫,因为这些文献数据积累在巨大的增长中。本文提出了一种基于本体的生物关系抽取系统框架。该框架是统一的,能够提取基因-疾病、基因-基因、蛋白质-蛋白质相互作用等多种关系。本文的主要贡献是提出了一种两级模式学习算法,并对模式进行了分层组织。
{"title":"ONBRIRES: Ontology-Based Biological Relation Extraction System","authors":"Minlie Huang, Xiaoyan Zhu, Shilin Ding, Hao Yu, Ming Li","doi":"10.1142/9781860947292_0036","DOIUrl":"https://doi.org/10.1142/9781860947292_0036","url":null,"abstract":"Automated discovery and extraction of biological relations from online documents, particularly MEDLINE texts, has become essential and urgent because such literature data are accumulated in a tremendous growth. In this paper, we present an ontology-based framework of biological relation extraction system. This framework is unified and able to extract several kinds of relations such as gene-disease, gene-gene, and protein-protein interactions etc. The main contributions of this paper are that we propose a two-level pattern learning algorithm, and organize patterns hierarchically.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79863044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Analyzing Inconsistency Toward Enhancing Integration of Biological Molecular Databases 分析不一致性促进生物分子数据库整合
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0023
Y. Chen, Qingfeng Chen
The rapid growth of biological databases not only provides biologists with abundant data but also presents a big challenge in relation to the analysis of data. Many data analysis approaches such as data mining, information retrieval and machine learning have been used to extract frequent patterns from diverse biological databases. However, the discrepancies, due to the differences in the structure of databases and their terminologies, result in a significant lack of interoperability. Although ontology-based approaches have been used to integrate biological databases, the inconsistent analysis of biological databases has been greatly disregarded. This paper presents a method by which to measure the degree of inconsistency between biological databases. It not only presents a guideline for correct and efficient database integration, but also exposes high quality data for data mining and knowledge discovery.
生物数据库的快速增长不仅为生物学家提供了丰富的数据,而且在数据分析方面也提出了巨大的挑战。许多数据分析方法,如数据挖掘、信息检索和机器学习,已被用于从不同的生物数据库中提取频繁的模式。但是,由于数据库结构及其术语的差异,这些差异导致了互操作性的严重缺乏。尽管基于本体论的方法已被用于整合生物数据库,但生物数据库的不一致分析却被大大忽视了。本文提出了一种测量生物数据库间不一致程度的方法。它不仅为正确、高效地集成数据库提供了指导,而且为数据挖掘和知识发现提供了高质量的数据。
{"title":"Analyzing Inconsistency Toward Enhancing Integration of Biological Molecular Databases","authors":"Y. Chen, Qingfeng Chen","doi":"10.1142/9781860947292_0023","DOIUrl":"https://doi.org/10.1142/9781860947292_0023","url":null,"abstract":"The rapid growth of biological databases not only provides biologists with abundant data but also presents a big challenge in relation to the analysis of data. Many data analysis approaches such as data mining, information retrieval and machine learning have been used to extract frequent patterns from diverse biological databases. However, the discrepancies, due to the differences in the structure of databases and their terminologies, result in a significant lack of interoperability. Although ontology-based approaches have been used to integrate biological databases, the inconsistent analysis of biological databases has been greatly disregarded. This paper presents a method by which to measure the degree of inconsistency between biological databases. It not only presents a guideline for correct and efficient database integration, but also exposes high quality data for data mining and knowledge discovery.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82992132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Accuracy of Four Heuristics for the Full Sibship Reconstruction Problem in the Presence of Genotype Errors 存在基因型错误的四种启发式全兄弟姐妹重构问题的准确性
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0004
D. Konovalov
The full sibship reconstruction (FSR) problem is the problem of inferring all groups of full siblings from a given population sample using genetic marker data without parental information. The FSR problem remains a significant challenge for computational biology, since an exact solution for the problem has not been found. The new algorithm, named SIMPSON-assisted Descending Ratio (SDR), is devised combining a new Simpson index based O(n2) algorithm (MS2) and the existing Descending Ratio (DR) algorithm. The SDR algorithm outperforms the SIMPSON, MS2, and DR algorithms in accuracy and robustness when tested on a variety of sample family structures. The accuracy error is measured as the percentage of incorrectly assigned individuals. The robustness of the FSR algorithms is assessed by simulating a 2% mutation rate per locus (a 1% rate per allele).
全兄弟姐妹重建(FSR)问题是在没有父母信息的情况下,利用遗传标记数据从给定的群体样本中推断出所有全兄弟姐妹群体的问题。FSR问题仍然是计算生物学的一个重大挑战,因为这个问题的精确解决方案还没有找到。该算法将基于辛普森指数的O(n2)算法(MS2)和现有的降序比(DR)算法相结合,设计了新的辛普森辅助降序比(SDR)算法。当在各种样本族结构上进行测试时,SDR算法在准确性和鲁棒性方面优于SIMPSON, MS2和DR算法。准确度误差是以不正确分配个体的百分比来衡量的。FSR算法的稳健性通过模拟每个位点2%的突变率(每个等位基因1%的突变率)来评估。
{"title":"Accuracy of Four Heuristics for the Full Sibship Reconstruction Problem in the Presence of Genotype Errors","authors":"D. Konovalov","doi":"10.1142/9781860947292_0004","DOIUrl":"https://doi.org/10.1142/9781860947292_0004","url":null,"abstract":"The full sibship reconstruction (FSR) problem is the problem of inferring all groups of full siblings from a given population sample using genetic marker data without parental information. The FSR problem remains a significant challenge for computational biology, since an exact solution for the problem has not been found. The new algorithm, named SIMPSON-assisted Descending Ratio (SDR), is devised combining a new Simpson index based O(n2) algorithm (MS2) and the existing Descending Ratio (DR) algorithm. The SDR algorithm outperforms the SIMPSON, MS2, and DR algorithms in accuracy and robustness when tested on a variety of sample family structures. The accuracy error is measured as the percentage of incorrectly assigned individuals. The robustness of the FSR algorithms is assessed by simulating a 2% mutation rate per locus (a 1% rate per allele).","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88588512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Predicting Ranked SCOP Domains by Mining Associations of Visual Contents in Distance Matrices 通过挖掘距离矩阵中视觉内容的关联预测SCOP排序域
Pub Date : 2005-12-01 DOI: 10.1142/9781860947292_0008
Pin-Hao Chi, C. Shyu
Protein tertiary structures are known to have significant correlations with their biological functions. To understand the information of the protein structures, Structural Classification of Protein (SCOP) Database, which is manually constructed by human experts, classifies similar protein folds in the same domain hierarchy. Even though this approach is believed to be more reliable than applying traditional alignment methods in structural classifications, it is labor intensive. In this paper, we build a non-parametric classifier to predict possible SCOP domains for unknown protein structures. With supervised learning, the algorithm first maps tertiary structures of training proteins into two-dimensional distance matrices, and then extracts signatures from visual contents of matrices. A knowledge discovery and data mining (KDD) process further discovers relevant patterns in training signatures of each SCOP domain by mining association rules. Finally, the quantity of rules whose patterns match signatures of unknown proteins determines predicted domains in a ranked order. We select 7,702 protein chains from 150 domains of SCOP database 1.67 release as labelled data using 10 fold cross validation. Experimental results show that the prediction accuracy is 91.27% for the top ranked domain and 99.22% for the top 5 ranked domains. The average response time takes 6.34 seconds, exhibiting reasonably high prediction accuracy and efficiency.
已知蛋白质三级结构与其生物学功能有显著的相关性。为了了解蛋白质的结构信息,由人类专家手工构建的蛋白质结构分类数据库(SCOP)将相似的蛋白质折叠在同一域层次中进行分类。尽管这种方法被认为比在结构分类中应用传统的对齐方法更可靠,但它是劳动密集型的。在本文中,我们建立了一个非参数分类器来预测未知蛋白质结构可能的SCOP结构域。通过监督学习,该算法首先将训练蛋白的三级结构映射到二维距离矩阵中,然后从矩阵的视觉内容中提取特征。知识发现和数据挖掘(KDD)过程通过挖掘关联规则进一步发现每个SCOP域训练签名中的相关模式。最后,其模式与未知蛋白质的特征相匹配的规则的数量决定了预测结构域的排序顺序。我们从SCOP数据库1.67版本的150个结构域中选择7702个蛋白链作为标记数据,使用10倍交叉验证。实验结果表明,对排名前1位的域的预测准确率为91.27%,对排名前5位的域的预测准确率为99.22%。平均响应时间为6.34秒,具有较高的预测精度和效率。
{"title":"Predicting Ranked SCOP Domains by Mining Associations of Visual Contents in Distance Matrices","authors":"Pin-Hao Chi, C. Shyu","doi":"10.1142/9781860947292_0008","DOIUrl":"https://doi.org/10.1142/9781860947292_0008","url":null,"abstract":"Protein tertiary structures are known to have significant correlations with their biological functions. To understand the information of the protein structures, Structural Classification of Protein (SCOP) Database, which is manually constructed by human experts, classifies similar protein folds in the same domain hierarchy. Even though this approach is believed to be more reliable than applying traditional alignment methods in structural classifications, it is labor intensive. In this paper, we build a non-parametric classifier to predict possible SCOP domains for unknown protein structures. With supervised learning, the algorithm first maps tertiary structures of training proteins into two-dimensional distance matrices, and then extracts signatures from visual contents of matrices. A knowledge discovery and data mining (KDD) process further discovers relevant patterns in training signatures of each SCOP domain by mining association rules. Finally, the quantity of rules whose patterns match signatures of unknown proteins determines predicted domains in a ranked order. We select 7,702 protein chains from 150 domains of SCOP database 1.67 release as labelled data using 10 fold cross validation. Experimental results show that the prediction accuracy is 91.27% for the top ranked domain and 99.22% for the top 5 ranked domains. The average response time takes 6.34 seconds, exhibiting reasonably high prediction accuracy and efficiency.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84618792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Cells In Silico (CIS): A biomedical simulation framework based on Markov random field 基于马尔可夫随机场的生物医学模拟框架
Pub Date : 2005-01-01 DOI: 10.1142/9781860947322_0015
Kung-Hao Liang
This paper presents CIS, a biomedical simulation framework based on the markov random field (MRF). CIS is a discrete domain 2-D simulation framework emphasizing on the spatial interactions of biomedical entities. The probability model within the MRF framework facilitates the construction of more realistic models than deterministic differential equatio n approaches and cellular automata. The global phenomenon in CIS are dictated by the local conditional probabilities. In addition, multiscale MRF is potentially useful for the modelling of complex biomedical phenomenon in multiple spatial and time scales. The methodology and procedure of CIS for a biomedical simulation is presented using the scenario of tumor-induced hypoxia and angiogenesis as an example. The goal of this research is to unveil the complex appearances of biomedical phenomenon using mathematical models, thus enhancing our understanding on the secrets of life. Computational cell biology is an emerging discipline where biomedical simulations are employed for the study of cells and their microenvironments in various spatio-temporal scales. The E-cell and the Virtual Cell projects focus on the molecular and biochemical level within cells, addressing the dynamics of signal transductional, regulatory and metabolic networks. The sub-cell compartmental model are constructed and integrated gradually so as to simulate a particular facet (or pat hway) of cells. The Epitheliome project is an example of tissue-level simulation, aiming to depict the epithelial cell growth and the social behavior of cells in culture. Simulations on higher-level systems include Physiome, and the modelling of many organs such as heart. Each scale of simulation shed light on different aspects of life. Biomedical simulations have been conducted in both the continuous and discrete domains. Differential equations are the key elements of continuous domain simulation, where the concentration of particular receptors, ligands, enzymes or metabolites are modelled at various spatial and temporal scales. This approach is limited by the fact that many biomedical phenomena are too complex to be described by sets of differential equations. In addition, the deterministic differential equations are not adequate for describing many biological phenomenon with a stochastic nature. Alternatively, discrete domain simulation are processed on a spatio-temporal discrete lattice. T he combination of Pott’s model and Metropolis algorithm have been used to simulate cell sorting, morphogenesis, the behavior of malignant tumor and the Tamoxifen treatment failure of cancer.
提出了一种基于马尔可夫随机场的生物医学仿真框架CIS。CIS是一个离散域二维仿真框架,强调生物医学实体的空间相互作用。MRF框架内的概率模型比确定性微分方程方法和元胞自动机更容易构建真实的模型。CIS中的全局现象是由局部条件概率决定的。此外,多尺度磁共振成像在多空间和时间尺度的复杂生物医学现象建模中具有潜在的用途。以肿瘤诱导的缺氧和血管生成为例,介绍了生物医学模拟的CIS方法和程序。这项研究的目的是利用数学模型揭示生物医学现象的复杂表象,从而增强我们对生命秘密的理解。计算细胞生物学是一门新兴的学科,生物医学模拟被用于研究细胞及其微环境在不同的时空尺度。e细胞和虚拟细胞项目专注于细胞内的分子和生化水平,解决信号转导,调节和代谢网络的动态。亚细胞区室模型是逐步构建和整合的,以模拟细胞的特定面(或部分路径)。上皮组项目是组织水平模拟的一个例子,旨在描述上皮细胞的生长和细胞在培养中的社会行为。高级系统的模拟包括生理组,以及许多器官的建模,如心脏。每个尺度的模拟都揭示了生活的不同方面。生物医学的模拟已经在连续和离散领域进行。微分方程是连续域模拟的关键要素,其中特定受体、配体、酶或代谢物的浓度在不同的空间和时间尺度上进行建模。由于许多生物医学现象过于复杂,无法用一组微分方程来描述,这种方法受到了限制。另外,确定性微分方程并不足以描述许多具有随机性质的生物现象。另一种方法是在时空离散晶格上进行离散域模拟。将Pott模型与Metropolis算法相结合,模拟了细胞分选、形态发生、恶性肿瘤的行为以及他莫昔芬治疗癌症的失败。
{"title":"Cells In Silico (CIS): A biomedical simulation framework based on Markov random field","authors":"Kung-Hao Liang","doi":"10.1142/9781860947322_0015","DOIUrl":"https://doi.org/10.1142/9781860947322_0015","url":null,"abstract":"This paper presents CIS, a biomedical simulation framework based on the markov random field (MRF). CIS is a discrete domain 2-D simulation framework emphasizing on the spatial interactions of biomedical entities. The probability model within the MRF framework facilitates the construction of more realistic models than deterministic differential equatio n approaches and cellular automata. The global phenomenon in CIS are dictated by the local conditional probabilities. In addition, multiscale MRF is potentially useful for the modelling of complex biomedical phenomenon in multiple spatial and time scales. The methodology and procedure of CIS for a biomedical simulation is presented using the scenario of tumor-induced hypoxia and angiogenesis as an example. The goal of this research is to unveil the complex appearances of biomedical phenomenon using mathematical models, thus enhancing our understanding on the secrets of life. Computational cell biology is an emerging discipline where biomedical simulations are employed for the study of cells and their microenvironments in various spatio-temporal scales. The E-cell and the Virtual Cell projects focus on the molecular and biochemical level within cells, addressing the dynamics of signal transductional, regulatory and metabolic networks. The sub-cell compartmental model are constructed and integrated gradually so as to simulate a particular facet (or pat hway) of cells. The Epitheliome project is an example of tissue-level simulation, aiming to depict the epithelial cell growth and the social behavior of cells in culture. Simulations on higher-level systems include Physiome, and the modelling of many organs such as heart. Each scale of simulation shed light on different aspects of life. Biomedical simulations have been conducted in both the continuous and discrete domains. Differential equations are the key elements of continuous domain simulation, where the concentration of particular receptors, ligands, enzymes or metabolites are modelled at various spatial and temporal scales. This approach is limited by the fact that many biomedical phenomena are too complex to be described by sets of differential equations. In addition, the deterministic differential equations are not adequate for describing many biological phenomenon with a stochastic nature. Alternatively, discrete domain simulation are processed on a spatio-temporal discrete lattice. T he combination of Pott’s model and Metropolis algorithm have been used to simulate cell sorting, morphogenesis, the behavior of malignant tumor and the Tamoxifen treatment failure of cancer.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80095880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Protein informatics towards integration of data grid and computing grid 迈向数据网格与计算网格融合的蛋白质信息学
Pub Date : 2005-01-01 DOI: 10.1142/9781860947322_0036
Haruki Nakamura
Information of the structures and functions of protein molecules and their mutual interactions that construct protein networks increases rapidly as the consequence of the structural genomics and structural proteomics projects [1]. Advanced applications of such information require the Grid technology to solve the two problems: (i) the shortage of computational power, and (ii) the lack of a capability for seamlessly and quickly retrieving data from the varieties of heterogeneous biological databases [2].
随着结构基因组学和结构蛋白质组学项目的开展,蛋白质分子的结构和功能及其相互作用的信息迅速增加,从而构建蛋白质网络[1]。这些信息的高级应用需要网格技术来解决两个问题:(i)计算能力的不足,(ii)缺乏从各种异构生物数据库中无缝快速检索数据的能力[2]。
{"title":"Protein informatics towards integration of data grid and computing grid","authors":"Haruki Nakamura","doi":"10.1142/9781860947322_0036","DOIUrl":"https://doi.org/10.1142/9781860947322_0036","url":null,"abstract":"Information of the structures and functions of protein molecules and their mutual interactions that construct protein networks increases rapidly as the consequence of the structural genomics and structural proteomics projects [1]. Advanced applications of such information require the Grid technology to solve the two problems: (i) the shortage of computational power, and (ii) the lack of a capability for seamlessly and quickly retrieving data from the varieties of heterogeneous biological databases [2].","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86978836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the ... Asia-Pacific bioinformatics conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1