首页 > 最新文献

Genome informatics. International Conference on Genome Informatics最新文献

英文 中文
Comprehensive analysis of sequence-structure relationships in the loop regions of proteins. 蛋白质环区序列结构关系的综合分析。
Pub Date : 2009-10-01 DOI: 10.1142/9781848165632_0010
Shugo Nakamura, K. Shimizu
Local sequence-structure relationships in the loop regions of proteins were comprehensively estimated using simple prediction tools based on support vector regression (SVR). End-to-end distance was selected as a rough structural property of fragments, and the end-to-end distances of an enormous number of loop fragments from a wide variety of protein folds were directly predicted from sequence information by using SVR. We found that our method was more accurate than random prediction for predicting the structure of fragments comprising 5, 9, and 17 amino acids; moreover, the extended loop fragments could be successfully distinguished from turn structures on the basis of their sequences, which implies that the sequence-structure relationships were significant for loop fragments with a wide range of end-to-end distances. These results suggest that many loop regions as well as helices and strands restrict the conformational space of the entire tertiary structure of proteins to some extent; moreover, our findings throw light on the mechanism of protein folding and prediction of the tertiary structure of proteins without using structural templates.
利用基于支持向量回归(SVR)的简单预测工具,对蛋白质环区局部序列结构关系进行综合估计。选取端到端距离作为片段的粗略结构属性,利用支持向量回归算法从序列信息中直接预测大量来自多种蛋白质折叠的环状片段的端到端距离。我们发现我们的方法在预测包含5、9和17个氨基酸的片段的结构时比随机预测更准确;此外,从序列上可以很好地区分出延伸的环状片段与转弯结构,这表明对于端到端距离较大的环状片段,序列-结构关系是显著的。这些结果表明,许多环区以及螺旋和链在一定程度上限制了蛋白质整个三级结构的构象空间;此外,我们的研究结果揭示了蛋白质折叠的机制和蛋白质三级结构的预测,而不使用结构模板。
{"title":"Comprehensive analysis of sequence-structure relationships in the loop regions of proteins.","authors":"Shugo Nakamura, K. Shimizu","doi":"10.1142/9781848165632_0010","DOIUrl":"https://doi.org/10.1142/9781848165632_0010","url":null,"abstract":"Local sequence-structure relationships in the loop regions of proteins were comprehensively estimated using simple prediction tools based on support vector regression (SVR). End-to-end distance was selected as a rough structural property of fragments, and the end-to-end distances of an enormous number of loop fragments from a wide variety of protein folds were directly predicted from sequence information by using SVR. We found that our method was more accurate than random prediction for predicting the structure of fragments comprising 5, 9, and 17 amino acids; moreover, the extended loop fragments could be successfully distinguished from turn structures on the basis of their sequences, which implies that the sequence-structure relationships were significant for loop fragments with a wide range of end-to-end distances. These results suggest that many loop regions as well as helices and strands restrict the conformational space of the entire tertiary structure of proteins to some extent; moreover, our findings throw light on the mechanism of protein folding and prediction of the tertiary structure of proteins without using structural templates.","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"144 1","pages":"106-16"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81070373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An assessment of prediction algorithms for nucleosome positioning. 核小体定位预测算法的评估。
Pub Date : 2009-10-01 DOI: 10.1142/9781848165632_0016
Yoshiaki Tanaka, K. Nakai
Nucleosome configuration in eukaryotic genomes is an important clue to clarify the mechanisms of regulation for various nuclear events. In the past few years, numerous computational tools have been developed for the prediction of nucleosome positioning, but there is no third-party benchmark about their performance. Here we present a performance evaluation using genome-scale in vivo nucleosome maps of two vertebrates and three invertebrates. In our measurement, two recently updated versions of Segal's model and Gupta's SVM with the RBF kernel, which was not implemented originally, showed higher prediction accuracy although their performances differ significantly in the prediction of medaka fish and candida yeast. The cross-species prediction results using Gupta's SVM also suggested rather specific characters of nucleosomal DNAs in medaka and budding yeast. With the analyses for over- and under-representat ion of DNA oligomers, we found both general and species-specific motifs in nucleosomal and linker DNAs. The oligomers commonly enriched in all five eukaryotes were only CA/TG and AC/GT. Thus, to achieve relatively high performance for a species, it is desirable to prepare the training data from the same species.
真核生物基因组中的核小体结构是阐明各种核事件调控机制的重要线索。在过去的几年中,已经开发了许多用于预测核小体定位的计算工具,但是没有第三方基准来衡量它们的性能。在这里,我们提出了一个性能评估使用基因组规模的核小体在体内两种脊椎动物和三种无脊椎动物。在我们的测量中,两个最近更新版本的Segal模型和Gupta的支持向量机与RBF内核(最初没有实现),显示出更高的预测精度,尽管它们在预测medaka鱼和念珠菌方面的性能差异很大。Gupta支持向量机的跨种预测结果也显示了medaka和出芽酵母核小体dna的特异性。通过分析DNA低聚物的代表性和代表性不足,我们在核小体和连接体DNA中发现了一般和物种特异性基序。5种真核生物中普遍富集的低聚物只有CA/TG和AC/GT。因此,为了获得一个物种相对较高的性能,需要准备来自同一物种的训练数据。
{"title":"An assessment of prediction algorithms for nucleosome positioning.","authors":"Yoshiaki Tanaka, K. Nakai","doi":"10.1142/9781848165632_0016","DOIUrl":"https://doi.org/10.1142/9781848165632_0016","url":null,"abstract":"Nucleosome configuration in eukaryotic genomes is an important clue to clarify the mechanisms of regulation for various nuclear events. In the past few years, numerous computational tools have been developed for the prediction of nucleosome positioning, but there is no third-party benchmark about their performance. Here we present a performance evaluation using genome-scale in vivo nucleosome maps of two vertebrates and three invertebrates. In our measurement, two recently updated versions of Segal's model and Gupta's SVM with the RBF kernel, which was not implemented originally, showed higher prediction accuracy although their performances differ significantly in the prediction of medaka fish and candida yeast. The cross-species prediction results using Gupta's SVM also suggested rather specific characters of nucleosomal DNAs in medaka and budding yeast. With the analyses for over- and under-representat ion of DNA oligomers, we found both general and species-specific motifs in nucleosomal and linker DNAs. The oligomers commonly enriched in all five eukaryotes were only CA/TG and AC/GT. Thus, to achieve relatively high performance for a species, it is desirable to prepare the training data from the same species.","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"22 1","pages":"169-78"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81566854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A new generation of homology search tools based on probabilistic inference. 基于概率推理的新一代同源搜索工具。
Pub Date : 2009-10-01 DOI: 10.1142/9781848165632_0019
S. Eddy
Many theoretical advances have been made in applying probabilistic inference methods to improve the power of sequence homology searches, yet the BLAST suite of programs is still the workhorse for most of the field. The main reason for this is practical: BLAST's programs are about 100-fold faster than the fastest competing implementations of probabilistic inference methods. I describe recent work on the HMMER software suite for protein sequence analysis, which implements probabilistic inference using profile hidden Markov models. Our aim in HMMER3 is to achieve BLAST's speed while further improving the power of probabilistic inference based methods. HMMER3 implements a new probabilistic model of local sequence alignment and a new heuristic acceleration algorithm. Combined with efficient vector-parallel implementations on modern processors, these improvements synergize. HMMER3 uses more powerful log-odds likelihood scores (scores summed over alignment uncertainty, rather than scoring a single optimal alignment); it calculates accurate expectation values (E-values) for those scores without simulation using a generalization of Karlin/Altschul theory; it computes posterior distributions over the ensemble of possible alignments and returns posterior probabilities (confidences) in each aligned residue; and it does all this at an overall speed comparable to BLAST. The HMMER project aims to usher in a new generation of more powerful homology search tools based on probabilistic inference methods.
在应用概率推理方法来提高序列同源性搜索的能力方面,已经取得了许多理论进展,但BLAST套件程序仍然是大多数领域的主力。这样做的主要原因是实用的:BLAST的程序比最快的概率推理方法的竞争实现快100倍左右。我描述了最近在蛋白质序列分析的HMMER软件套件上的工作,该软件使用剖面隐马尔可夫模型实现了概率推断。我们在HMMER3中的目标是达到BLAST的速度,同时进一步提高基于概率推理的方法的能力。HMMER3实现了一种新的局部序列对齐概率模型和一种新的启发式加速算法。结合现代处理器上高效的矢量并行实现,这些改进协同作用。HMMER3使用更强大的对数概率可能性评分(评分总和超过对齐不确定性,而不是单一的最佳对齐评分);它计算准确的期望值(e值),这些分数没有模拟使用Karlin/Altschul理论的推广;它计算可能对齐集合的后验分布,并返回每个对齐残差的后验概率(置信度);它的整体速度与BLAST相当。HMMER项目旨在引入基于概率推理方法的新一代更强大的同源性搜索工具。
{"title":"A new generation of homology search tools based on probabilistic inference.","authors":"S. Eddy","doi":"10.1142/9781848165632_0019","DOIUrl":"https://doi.org/10.1142/9781848165632_0019","url":null,"abstract":"Many theoretical advances have been made in applying probabilistic inference methods to improve the power of sequence homology searches, yet the BLAST suite of programs is still the workhorse for most of the field. The main reason for this is practical: BLAST's programs are about 100-fold faster than the fastest competing implementations of probabilistic inference methods. I describe recent work on the HMMER software suite for protein sequence analysis, which implements probabilistic inference using profile hidden Markov models. Our aim in HMMER3 is to achieve BLAST's speed while further improving the power of probabilistic inference based methods. HMMER3 implements a new probabilistic model of local sequence alignment and a new heuristic acceleration algorithm. Combined with efficient vector-parallel implementations on modern processors, these improvements synergize. HMMER3 uses more powerful log-odds likelihood scores (scores summed over alignment uncertainty, rather than scoring a single optimal alignment); it calculates accurate expectation values (E-values) for those scores without simulation using a generalization of Karlin/Altschul theory; it computes posterior distributions over the ensemble of possible alignments and returns posterior probabilities (confidences) in each aligned residue; and it does all this at an overall speed comparable to BLAST. The HMMER project aims to usher in a new generation of more powerful homology search tools based on probabilistic inference methods.","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"79 1","pages":"205-11"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73217004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1178
Thinking laterally about genomes. 从侧面思考基因组。
Mark A Ragan

Perhaps the most-surprising discovery of the genome era has been the extent to which prokaryotic and many eukaryotic genomes incorporate genetic material from sources other than their parent(s). Lateral genetic transfer (LGT) among bacteria was first observed about 100 years ago, and is now accepted to underlie important phenomena including the spread of antibiotic resistance and ability to degrade xenobiotics. LGT is invoked, perhaps too readily, to explain a breadth of awkward data including compositional heterogeneity of genomes, disagreement among gene-sequence trees, and mismatch between physiology and systematics. At the same time many details of LGT remain unknown or controversial, and some key questions have scarcely been asked. Here I critically review what we think we know about the existence, extent, mechanism and impact of LGT; identify important open questions; and point to research directions that hold particular promise for elucidating the role of LGT in genome evolution. Evidence for LGT in nature is not only inferential but also direct, and potential vectors are ubiquitous. Genetic material can pass between diverse habitats and be significantly altered during residency in viruses, complicating the inference of donors, In prokaryotes about twice as many genes are interrupted by LGT as are transferred intact, and about 5Short protein domains can be privileged units of transfer. Unresolved phylogenetic issues include the correct null hypothesis, and genes as units of analysis. Themes are beginning to emerge regarding the effect of LGT on cellular networks, but I show why generalization is premature. LGT can associate with radical changes in physiology and ecological niche. Better quantitative models of genome evolution are needed, and theoretical frameworks remain to be developed for some observations including chromosome assembly by LGT.

也许基因组时代最令人惊讶的发现是原核生物和许多真核生物基因组在很大程度上包含了来自其亲本以外来源的遗传物质。细菌之间的横向遗传转移(LGT)在大约100年前首次被观察到,现在被认为是重要现象的基础,包括抗生素耐药性的传播和降解外源药物的能力。LGT被用来解释一系列令人尴尬的数据,包括基因组的组成异质性,基因序列树之间的分歧,以及生理学和系统学之间的不匹配,这可能太容易了。与此同时,LGT的许多细节仍然未知或有争议,一些关键问题几乎没有被问到。在这里,我批判性地回顾了我们认为我们对LGT的存在、程度、机制和影响的了解;确定重要的开放性问题;并指出了对阐明LGT在基因组进化中的作用特别有希望的研究方向。自然界中存在LGT的证据不仅是推断性的,而且是直接的,潜在的载体无处不在。遗传物质可以在不同的栖息地之间传递,并在病毒体内驻留期间发生显著改变,使供体的推断变得复杂。在原核生物中,被LGT打断的基因大约是完整转移的基因的两倍,大约5个短蛋白结构域可以作为转移的特权单位。未解决的系统发育问题包括正确的零假设,以及作为分析单位的基因。关于LGT对蜂窝网络的影响的主题开始出现,但我说明了为什么泛化还为时过早。LGT与生理和生态位的根本变化有关。目前还需要更好的基因组进化定量模型,而对于包括LGT染色体组装在内的一些观察,还需要建立理论框架。
{"title":"Thinking laterally about genomes.","authors":"Mark A Ragan","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Perhaps the most-surprising discovery of the genome era has been the extent to which prokaryotic and many eukaryotic genomes incorporate genetic material from sources other than their parent(s). Lateral genetic transfer (LGT) among bacteria was first observed about 100 years ago, and is now accepted to underlie important phenomena including the spread of antibiotic resistance and ability to degrade xenobiotics. LGT is invoked, perhaps too readily, to explain a breadth of awkward data including compositional heterogeneity of genomes, disagreement among gene-sequence trees, and mismatch between physiology and systematics. At the same time many details of LGT remain unknown or controversial, and some key questions have scarcely been asked. Here I critically review what we think we know about the existence, extent, mechanism and impact of LGT; identify important open questions; and point to research directions that hold particular promise for elucidating the role of LGT in genome evolution. Evidence for LGT in nature is not only inferential but also direct, and potential vectors are ubiquitous. Genetic material can pass between diverse habitats and be significantly altered during residency in viruses, complicating the inference of donors, In prokaryotes about twice as many genes are interrupted by LGT as are transferred intact, and about 5Short protein domains can be privileged units of transfer. Unresolved phylogenetic issues include the correct null hypothesis, and genes as units of analysis. Themes are beginning to emerge regarding the effect of LGT on cellular networks, but I show why generalization is premature. LGT can associate with radical changes in physiology and ecological niche. Better quantitative models of genome evolution are needed, and theoretical frameworks remain to be developed for some observations including chromosome assembly by LGT.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"221-2"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28733806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The prediction of local modular structures in a co-expression network based on gene expression datasets. 基于基因表达数据集的共表达网络局部模块化结构预测。
Yoshiyuki Ogata, Nozomu Sakurai, Hideyuki Suzuki, Koh Aoki, Kazuki Saito, Daisuke Shibata

In scientific fields such as systems biology, evaluation of the relationship between network members (vertices) is approached using a network structure. In a co-expression network, comprising genes (vertices) and gene-to-gene links (edges) representing co-expression relationships, local modular structures with tight intra-modular connections include genes that are co-expressed with each other. For detecting such modules from among the whole network, an approach to evaluate network topology between modules as well as intra-modular network topology is useful. To detect such modules, we combined a novel inter-modular index with network density, the representative intra-modular index, instead of a single use of network density. We designed an algorithm to optimize the combinatory index for a module and applied it to Arabidopsis co-expression analysis. To verify the relation between modules obtained using our algorithm and biological knowledge, we compared it to the other tools for co-expression network analyses using the KEGG pathways, indicating that our algorithm detected network modules representing better associations with the pathways. It is also applicable to a large dataset of gene expression profiles, which is difficult to calculate in a mass.

在系统生物学等科学领域,网络成员(顶点)之间的关系的评估是使用网络结构来进行的。在共表达网络中,由基因(顶点)和代表共表达关系的基因-基因链接(边)组成,具有紧密模块内连接的局部模块结构包括彼此共表达的基因。为了从整个网络中检测这些模块,一种评估模块间网络拓扑和模块内网络拓扑的方法是有用的。为了检测这些模块,我们将一种新颖的模块间指数与网络密度相结合,即具有代表性的模块内指数,而不是单一使用网络密度。我们设计了一种优化模块组合索引的算法,并将其应用于拟南芥共表达分析。为了验证使用我们的算法获得的模块与生物学知识之间的关系,我们将其与使用KEGG途径进行共表达网络分析的其他工具进行了比较,表明我们的算法检测到的网络模块与这些途径有更好的关联。它也适用于大量难以计算的基因表达谱数据集。
{"title":"The prediction of local modular structures in a co-expression network based on gene expression datasets.","authors":"Yoshiyuki Ogata,&nbsp;Nozomu Sakurai,&nbsp;Hideyuki Suzuki,&nbsp;Koh Aoki,&nbsp;Kazuki Saito,&nbsp;Daisuke Shibata","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In scientific fields such as systems biology, evaluation of the relationship between network members (vertices) is approached using a network structure. In a co-expression network, comprising genes (vertices) and gene-to-gene links (edges) representing co-expression relationships, local modular structures with tight intra-modular connections include genes that are co-expressed with each other. For detecting such modules from among the whole network, an approach to evaluate network topology between modules as well as intra-modular network topology is useful. To detect such modules, we combined a novel inter-modular index with network density, the representative intra-modular index, instead of a single use of network density. We designed an algorithm to optimize the combinatory index for a module and applied it to Arabidopsis co-expression analysis. To verify the relation between modules obtained using our algorithm and biological knowledge, we compared it to the other tools for co-expression network analyses using the KEGG pathways, indicating that our algorithm detected network modules representing better associations with the pathways. It is also applicable to a large dataset of gene expression profiles, which is difficult to calculate in a mass.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"117-27"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28735857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A method for efficient execution of bioinformatics workflows. 一种有效执行生物信息学工作流程的方法。
Junya Seo, Yoshiyuki Kido, Shigeto Seno, Yoichi Takenaka, Hideo Matsuda

Efficient execution of data-intensive workflows has been playing an important role in bioinformatics as the amount of data has been rapidly increasing. The execution of such workflows must take into account the volume and pattern of communication. When orchestrating data-centric workflows, a centralized workflow engine can become a bottleneck to performance. To cope with the bottleneck, a hybrid approach with choreography for data management of workflows is proposed. However, when a workflow includes many repetitive operations, the approach might not gain good performance because of the overheads of its additional mechanism. This paper presents and evaluates an improvement of the hybrid approach for managing a large amount of data. The performance of the proposed method is demonstrated by measuring execution times of example workflows.

随着数据量的迅速增加,数据密集型工作流程的有效执行在生物信息学中发挥着重要作用。此类工作流的执行必须考虑到通信的数量和模式。在编排以数据为中心的工作流时,集中式工作流引擎可能成为性能的瓶颈。为了解决这一瓶颈,提出了一种结合编排的工作流数据管理混合方法。然而,当工作流包含许多重复操作时,由于其附加机制的开销,该方法可能无法获得良好的性能。本文提出并评价了一种用于管理大量数据的混合方法的改进。通过测量示例工作流的执行时间,验证了所提方法的性能。
{"title":"A method for efficient execution of bioinformatics workflows.","authors":"Junya Seo,&nbsp;Yoshiyuki Kido,&nbsp;Shigeto Seno,&nbsp;Yoichi Takenaka,&nbsp;Hideo Matsuda","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Efficient execution of data-intensive workflows has been playing an important role in bioinformatics as the amount of data has been rapidly increasing. The execution of such workflows must take into account the volume and pattern of communication. When orchestrating data-centric workflows, a centralized workflow engine can become a bottleneck to performance. To cope with the bottleneck, a hybrid approach with choreography for data management of workflows is proposed. However, when a workflow includes many repetitive operations, the approach might not gain good performance because of the overheads of its additional mechanism. This paper presents and evaluates an improvement of the hybrid approach for managing a large amount of data. The performance of the proposed method is demonstrated by measuring execution times of example workflows.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"139-48"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28735859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quality control and reproducibility in DNA microarray experiments. DNA微阵列实验的质量控制和可重复性。
Pub Date : 2009-10-01 DOI: 10.1142/9781848165632_0003
André Fujita, J. Sato, Fernando H L DA Silva, Maria C Galvão, M. Sogayar, S. Miyano
Biological experiments are usually set up in technical replicates (duplicates or triplicates) in order to ensure reproducibility and, to assess any significant error introduced during the experimental process. The first step in biological data analysis is to check the technical replicates and to confirm that the error of measure is small enough to be of no concern. However, little attention has been paid to this part of analysis. Here, we propose a general process to estimate the error of measure and consequently, to provide an interpretable and objective way to ensure the technical replicates' quality. Particularly, we illustrate our application in a DNA microarray dataset set up in technical duplicates.
生物实验通常在技术重复(重复或三次)中进行,以确保再现性,并评估实验过程中引入的任何重大错误。生物数据分析的第一步是检查技术重复,并确认测量误差足够小,无需关注。然而,这部分的分析却很少受到重视。在此,我们提出了一个估计测量误差的一般过程,从而提供一个可解释和客观的方法来确保技术复制的质量。特别地,我们说明了我们在技术副本中建立的DNA微阵列数据集中的应用。
{"title":"Quality control and reproducibility in DNA microarray experiments.","authors":"André Fujita, J. Sato, Fernando H L DA Silva, Maria C Galvão, M. Sogayar, S. Miyano","doi":"10.1142/9781848165632_0003","DOIUrl":"https://doi.org/10.1142/9781848165632_0003","url":null,"abstract":"Biological experiments are usually set up in technical replicates (duplicates or triplicates) in order to ensure reproducibility and, to assess any significant error introduced during the experimental process. The first step in biological data analysis is to check the technical replicates and to confirm that the error of measure is small enough to be of no concern. However, little attention has been paid to this part of analysis. Here, we propose a general process to estimate the error of measure and consequently, to provide an interpretable and objective way to ensure the technical replicates' quality. Particularly, we illustrate our application in a DNA microarray dataset set up in technical duplicates.","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"1 1","pages":"21-31"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89746055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Cancer classification using single genes. 使用单基因进行癌症分类。
Pub Date : 2009-10-01 DOI: 10.1142/9781848165632_0017
Xiaosheng Wang, O. Gotoh
We present a method for She classification of cancer based on gene expression profiles using single genes. We select the genes with high class-discrimination capability according to their depended degree by the classes. We then build classifiers based on the decision rules induced by single genes selected. We test our single-gene classification method on three publicly available cancerous gene expression datasets. In a majority of cases, we gain relatively accurate classification outcomes by just utilizing one gene. Some genes highly correlated with the pathogenesis of cancer are identified. Our feature selection and classification approaches are both based on rough sets, a machine learning method. In comparison with other methods, our method is simple, effective and robust. We conclude that, if gene selection is implemented reasonably, accurate molecular classification of cancer can be achieved with very simple predictive models based on gene expression profiles.
我们提出了一种基于单个基因表达谱的癌症She分类方法。根据类对基因的依赖程度选择具有高类区分能力的基因。然后,我们根据所选择的单个基因诱导的决策规则构建分类器。我们在三个公开的癌症基因表达数据集上测试了我们的单基因分类方法。在大多数情况下,我们只需利用一个基因就可以获得相对准确的分类结果。发现了一些与癌症发病机制高度相关的基因。我们的特征选择和分类方法都是基于粗糙集,一种机器学习方法。与其他方法相比,该方法简单、有效、鲁棒性好。我们的结论是,如果合理地实施基因选择,基于基因表达谱的非常简单的预测模型就可以实现准确的癌症分子分类。
{"title":"Cancer classification using single genes.","authors":"Xiaosheng Wang, O. Gotoh","doi":"10.1142/9781848165632_0017","DOIUrl":"https://doi.org/10.1142/9781848165632_0017","url":null,"abstract":"We present a method for She classification of cancer based on gene expression profiles using single genes. We select the genes with high class-discrimination capability according to their depended degree by the classes. We then build classifiers based on the decision rules induced by single genes selected. We test our single-gene classification method on three publicly available cancerous gene expression datasets. In a majority of cases, we gain relatively accurate classification outcomes by just utilizing one gene. Some genes highly correlated with the pathogenesis of cancer are identified. Our feature selection and classification approaches are both based on rough sets, a machine learning method. In comparison with other methods, our method is simple, effective and robust. We conclude that, if gene selection is implemented reasonably, accurate molecular classification of cancer can be achieved with very simple predictive models based on gene expression profiles.","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"17 1","pages":"179-88"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81482386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Cancer classification using single genes. 使用单基因进行癌症分类。
Xiaosheng Wang, Osamu Gotoh

We present a method for She classification of cancer based on gene expression profiles using single genes. We select the genes with high class-discrimination capability according to their depended degree by the classes. We then build classifiers based on the decision rules induced by single genes selected. We test our single-gene classification method on three publicly available cancerous gene expression datasets. In a majority of cases, we gain relatively accurate classification outcomes by just utilizing one gene. Some genes highly correlated with the pathogenesis of cancer are identified. Our feature selection and classification approaches are both based on rough sets, a machine learning method. In comparison with other methods, our method is simple, effective and robust. We conclude that, if gene selection is implemented reasonably, accurate molecular classification of cancer can be achieved with very simple predictive models based on gene expression profiles.

我们提出了一种基于单个基因表达谱的癌症She分类方法。根据类对基因的依赖程度选择具有高类区分能力的基因。然后,我们根据所选择的单个基因诱导的决策规则构建分类器。我们在三个公开的癌症基因表达数据集上测试了我们的单基因分类方法。在大多数情况下,我们只需利用一个基因就可以获得相对准确的分类结果。发现了一些与癌症发病机制高度相关的基因。我们的特征选择和分类方法都是基于粗糙集,一种机器学习方法。与其他方法相比,该方法简单、有效、鲁棒性好。我们的结论是,如果合理地实施基因选择,基于基因表达谱的非常简单的预测模型就可以实现准确的癌症分子分类。
{"title":"Cancer classification using single genes.","authors":"Xiaosheng Wang,&nbsp;Osamu Gotoh","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We present a method for She classification of cancer based on gene expression profiles using single genes. We select the genes with high class-discrimination capability according to their depended degree by the classes. We then build classifiers based on the decision rules induced by single genes selected. We test our single-gene classification method on three publicly available cancerous gene expression datasets. In a majority of cases, we gain relatively accurate classification outcomes by just utilizing one gene. Some genes highly correlated with the pathogenesis of cancer are identified. Our feature selection and classification approaches are both based on rough sets, a machine learning method. In comparison with other methods, our method is simple, effective and robust. We conclude that, if gene selection is implemented reasonably, accurate molecular classification of cancer can be achieved with very simple predictive models based on gene expression profiles.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"179-88"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28733800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Localized suffix array and its application to genome mapping problems for paired-end short reads. 局部后缀阵列及其在配对短读基因组定位中的应用。
Kouichi Kimura, Asako Koike

We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.

本文介绍了一种新的数据结构——局部后缀数组,在此基础上,文本搜索应用中的出现信息被动态地表示为全局位置信息和本地字典顺序信息的组合。对于在给定距离内搜索一对单词,许多共享粗粒度全局位置的候选位置可以像在传统后缀数组中一样,按照本地字典顺序紧凑地表示,并且可以在粗粒度分辨率下同时检查它们是否违反距离约束。位置和字典信息之间的权衡逐渐向更精细的位置分辨率转移,并相应地重新检查距离约束。因此,即使每个单词有大量的出现,配对搜索也可以有效地执行。本地化后缀数组本身实际上是对传统后缀数组内的位重新排序,它们的内存需求本质上是相同的。我们展示了新一代DNA测序仪产生的对端短读的基因组定位问题的应用。当成对读取高度重复时,naïvely计算、排序和比较所有坐标非常耗时。对于36个碱基对的人类基因组重测序数据,在几乎一半的配对读取的冗余总和(个体出现的数量)大于2000的情况下,观察到比naïve方法加速10倍以上。
{"title":"Localized suffix array and its application to genome mapping problems for paired-end short reads.","authors":"Kouichi Kimura,&nbsp;Asako Koike","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"60-71"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28734942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Genome informatics. International Conference on Genome Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1