IPSJ Transactions on Bioinformatics最新文献

英文中文

Sparse Learner Boosting for Gene Expression Data 基因表达数据的稀疏学习器增强

Q3 Biochemistry, Genetics and Molecular Biology

IPSJ Transactions on Bioinformatics

Pub Date : 2010-01-01 DOI: 10.2197/IPSJTBIO.3.54

M. Pritchard

Gene expression analysis is commonly used to analyze millions of gene expression data points. Challenging in this process has been the development of appropriate statistical methods for high-dimensional data. We propose Sparse Learner Boosting for gene expression data analysis. Boosting is performed to minimize the loss function, although this process can cause overfitting when a large number of variables are present. Ordinary boosting utilizes all of the potential weak learners in a given data set and constructs a decision rule. The fundamental idea of Sparse Learner Boosting is to reduce the complexity of the decision rule by using fewer weak learners than is usually required. This reduction prevents overfitting and improves performance during classification. Numerical studies support this modification for high-dimensional data, such as that obtained from gene expression analysis. We show that the proposed modification improves the performance of ordinary boosting methods.

基因表达分析通常用于分析数百万个基因表达数据点。在这一过程中具有挑战性的是为高维数据开发适当的统计方法。我们提出稀疏学习器增强用于基因表达数据分析。增强是为了最小化损失函数，尽管当存在大量变量时，这个过程可能会导致过拟合。普通增强利用给定数据集中所有潜在的弱学习器，构造一个决策规则。稀疏学习器增强的基本思想是通过使用比通常所需更少的弱学习器来降低决策规则的复杂性。这种减少可以防止过拟合并提高分类过程中的性能。数值研究支持这种对高维数据的修改，例如从基因表达分析中获得的数据。结果表明，所提出的改进改进了普通增强方法的性能。

引用次数: 1

Support vector machine prediction of N-and O-glycosylation sites using whole sequence information and subcellular localization 基于全序列信息和亚细胞定位的支持向量机预测n和o糖基化位点

Q3 Biochemistry, Genetics and Molecular Biology

IPSJ Transactions on Bioinformatics

Pub Date : 2009-12-01 DOI: 10.2197/IPSJTBIO.2.25

Kenta Sasaki, Nobuyoshi Nagamine, Y. Sakakibara

Background: Glycans, or sugar chains, are one of the three types of chain (DNA, protein and glycan) that constitute living organisms; they are often called “the third chain of the living organism”. About half of all proteins are estimated to be glycosylated based on the SWISS-PROT database. Glycosylation is one of the most important post-translational modifications, affecting many critical functions of proteins, including cellular communication, and their tertiary structure. In order to computationally predict N-glycosylation and O-glycosylation sites, we developed three kinds of support vector machine (SVM) model, which utilize local information, general protein information and/or subcellular localization in consideration of the binding specificity of glycosyltransferases and the characteristic subcellular localization of glycoproteins. Results: In our computational experiment, the model integrating three kinds of information achieved about 90% accuracy in predictions of both N-glycosylation and O-glycosylation sites. Moreover, our model was applied to a protein whose glycosylation sites had not been previously identified and we succeeded in showing that the glycosylation sites predicted by our model were structurally reasonable. Conclusions: In the present study, we developed a comprehensive and effective computational method that detects glycosylation sites. We conclude that our method is a comprehensive and effective computational prediction method that is applicable at a genome-wide level.

背景:聚糖或糖链是构成生物体的三种链(DNA、蛋白质和聚糖)之一;它们通常被称为“生物体的第三链”。根据SWISS-PROT数据库估计，大约一半的蛋白质被糖基化。糖基化是最重要的翻译后修饰之一，影响蛋白质的许多关键功能，包括细胞通讯和它们的三级结构。为了计算预测n -糖基化位点和o -糖基化位点，考虑到糖基转移酶的结合特异性和糖蛋白的亚细胞定位特性，我们开发了三种支持向量机(SVM)模型，分别利用局部信息、一般蛋白质信息和/或亚细胞定位。结果:在我们的计算实验中，整合三种信息的模型对n -糖基化位点和o -糖基化位点的预测准确率均达到90%左右。此外，我们的模型应用于一种糖基化位点之前未被确定的蛋白质，我们成功地证明了我们的模型预测的糖基化位点在结构上是合理的。结论:在本研究中，我们开发了一种全面有效的检测糖基化位点的计算方法。结果表明，该方法是一种全面有效的计算预测方法，适用于全基因组水平。

{"title":"Support vector machine prediction of N-and O-glycosylation sites using whole sequence information and subcellular localization","authors":"Kenta Sasaki, Nobuyoshi Nagamine, Y. Sakakibara","doi":"10.2197/IPSJTBIO.2.25","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.2.25","url":null,"abstract":"Background: Glycans, or sugar chains, are one of the three types of chain (DNA, protein and glycan) that constitute living organisms; they are often called “the third chain of the living organism”. About half of all proteins are estimated to be glycosylated based on the SWISS-PROT database. Glycosylation is one of the most important post-translational modifications, affecting many critical functions of proteins, including cellular communication, and their tertiary structure. In order to computationally predict N-glycosylation and O-glycosylation sites, we developed three kinds of support vector machine (SVM) model, which utilize local information, general protein information and/or subcellular localization in consideration of the binding specificity of glycosyltransferases and the characteristic subcellular localization of glycoproteins. Results: In our computational experiment, the model integrating three kinds of information achieved about 90% accuracy in predictions of both N-glycosylation and O-glycosylation sites. Moreover, our model was applied to a protein whose glycosylation sites had not been previously identified and we succeeded in showing that the glycosylation sites predicted by our model were structurally reasonable. Conclusions: In the present study, we developed a comprehensive and effective computational method that detects glycosylation sites. We conclude that our method is a comprehensive and effective computational prediction method that is applicable at a genome-wide level.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"2 1","pages":"25-35"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.2.25","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68502318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

A Modified Algorithm for Sequence Alignment Using Ant Colony System 一种改进的蚁群序列比对算法

Q3 Biochemistry, Genetics and Molecular Biology

IPSJ Transactions on Bioinformatics

Pub Date : 2009-12-01 DOI: 10.2197/IPSJTBIO.2.63

A. Mikami, Jianming Shi

In this study, we use the Ant Colony System (ACS) to develop a heuristic algorithm for sequence alignment. This algorithm is certainly an improvement on ACS-MultiAlignment, which was proposed in 2005 for predicting major histocompatibility complex (MHC) class II binders. The numerical experiments indicate that this algorithm is as much as 2, 900 times faster than the original ACS-MultiAlignment algorithm. We also compare this algorithm to the other approaches such as Gibbs sampling algorithm using numerical experiments. The results show that our algorithm finds the best value prompter than Gibbs approach.

在这项研究中，我们使用蚁群系统(ACS)开发了一种启发式序列比对算法。该算法无疑是ACS-MultiAlignment的改进，ACS-MultiAlignment于2005年提出，用于预测主要组织相容性复合体(MHC) II类结合物。数值实验表明，该算法比原acs - multi - alignment算法快2900倍。并通过数值实验将该算法与Gibbs抽样算法等其他方法进行了比较。结果表明，该算法比Gibbs方法找到了最优的价值提示符。

引用次数: 3

Q3 Biochemistry, Genetics and Molecular Biology

IPSJ Transactions on Bioinformatics

Pub Date : 2009-03-24 DOI: 10.2197/IPSJTBIO.2.15

Y. Tohsato, Yuki Nishimura

Comparative analyses of enzymatic reactions provide important information on both evolution and potential pharmacological targets. Previously, we focused on the structural formulae of compounds, and proposed a method to calculate enzymatic similarities based on these formulae. However, with the proposed method it is difficult to measure the reaction similarity when the formulae of the compounds constituting each reaction are completely different. The present study was performed to extract substructures that change within chemical compounds using the RPAIR data in KEGG. Two approaches were applied to measure the similarity between the extracted substructures: a fingerprint-based approach using the MACCS key and the Tanimoto/Jaccard coefficients; and the Topological Fragment Spectra-based approach that does not require any predefined list of substructures. Whether the similarity measures can detect similarity between enzymatic reactions was evaluated. Using one of the similarity measures, metabolic pathways in Escherichia coli were aligned to confirm the effectiveness of the method.

酶促反应的比较分析提供了进化和潜在药理靶点的重要信息。在此之前，我们主要关注化合物的结构式，并提出了一种基于这些分子式计算酶促相似度的方法。然而，当组成反应的化合物的分子式完全不同时，用该方法难以测量反应的相似度。本研究利用KEGG中的RPAIR数据提取化合物中变化的子结构。采用两种方法来测量提取的子结构之间的相似性:一种基于指纹的方法，使用MACCS密钥和谷本/雅卡德系数;以及不需要任何预定义子结构列表的基于拓扑片段谱的方法。评价相似性测度是否能检测酶促反应之间的相似性。利用其中一种相似性测量方法，对大肠杆菌的代谢途径进行了比对，以确认该方法的有效性。

引用次数: 4

Selection of Effective Sentences from a Corpus to Improve the Accuracy of Identification of Protein Names 从语料库中选择有效句子以提高蛋白质名称识别的准确性

Q3 Biochemistry, Genetics and Molecular Biology

IPSJ Transactions on Bioinformatics

Pub Date : 2009-01-01 DOI: 10.2197/IPSJTBIO.2.93

Kazunori Miyanishi, Tomonobu Ozaki, T. Ohkawa

As the number of documents about protein structural analysis increases, a method of automatically identifying protein names in them is required. However, the accuracy of identification is not high if the training data set is not large enough. We consider a method to extend a training data set based on machine learning using an available corpus. Such a corpus usually consists of documents about a certain kind of organism species, and documents about different kinds of organism species tend to have different vocabularies. Therefore, depending on the target document or corpus, it is not effective for the accurate identification to simply use a corpus as a training data set. In order to improve the accuracy, we propose a method to select sentences that have a positive effect on identification and to extend the training data set with the selected sentences. In the proposed method, a portion of a set of tagged sentences is used as a validation set. The process to select sentences is iterated using the result of the identification of protein names in a validation set as feedback. In the experiment, compared with the baseline, a method without a corpus, with a whole corpus, or with a part of a corpus chosen at random, the accuracy of the proposed method was higher than any baseline method. Thus, it was confirmed that the proposed method selected effective sentences.

随着蛋白质结构分析文献数量的增加，需要一种自动识别其中蛋白质名称的方法。然而，如果训练数据集不够大，识别的准确率就不高。我们考虑了一种使用可用语料库扩展基于机器学习的训练数据集的方法。这类语料库通常由某一类生物物种的文献组成，而不同种类生物物种的文献往往有不同的词汇。因此，根据目标文档或语料库的不同，简单地使用语料库作为训练数据集对准确识别是无效的。为了提高准确率，我们提出了一种方法来选择对识别有积极影响的句子，并用所选择的句子扩展训练数据集。在提出的方法中，使用标记句子集的一部分作为验证集。使用验证集中蛋白质名称的识别结果作为反馈，迭代选择句子的过程。在实验中，与基线方法、无语料库方法、全语料库方法和随机选取部分语料库方法相比，本文方法的准确率均高于任何基线方法。从而证实了该方法选择了有效的句子。

{"title":"Selection of Effective Sentences from a Corpus to Improve the Accuracy of Identification of Protein Names","authors":"Kazunori Miyanishi, Tomonobu Ozaki, T. Ohkawa","doi":"10.2197/IPSJTBIO.2.93","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.2.93","url":null,"abstract":"As the number of documents about protein structural analysis increases, a method of automatically identifying protein names in them is required. However, the accuracy of identification is not high if the training data set is not large enough. We consider a method to extend a training data set based on machine learning using an available corpus. Such a corpus usually consists of documents about a certain kind of organism species, and documents about different kinds of organism species tend to have different vocabularies. Therefore, depending on the target document or corpus, it is not effective for the accurate identification to simply use a corpus as a training data set. In order to improve the accuracy, we propose a method to select sentences that have a positive effect on identification and to extend the training data set with the selected sentences. In the proposed method, a portion of a set of tagged sentences is used as a validation set. The process to select sentences is iterated using the result of the identification of protein names in a validation set as feedback. In the experiment, compared with the baseline, a method without a corpus, with a whole corpus, or with a part of a corpus chosen at random, the accuracy of the proposed method was higher than any baseline method. Thus, it was confirmed that the proposed method selected effective sentences.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"2 1","pages":"93-100"},"PeriodicalIF":0.0,"publicationDate":"2009-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.2.93","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68502015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Nonmetric Distances for Barcode of Life 生命条码的非度量距离

Q3 Biochemistry, Genetics and Molecular Biology

IPSJ Transactions on Bioinformatics

Pub Date : 2008-01-01 DOI: 10.2197/IPSJTBIO.1.35

H. Akiba, Y-h. Taguchi

Barcode of Life (BOL) project[4] is the project to enable us to recognize species easier. Although it is often troublesome to define what the species are, BOL can define species by simple DNA sequences. When it works, we do not have to consult with any other information than DNA sequences to decide if two individuals belong to the same species or not. If they share same BOL with each other, they belong to the same species undoubtedly. In contrast to this, it is usually difficult to define what the higher clade are. We cannot expect that each individual which belong to the same upper Claude share the same BOL. Instead, we have to find how BOL of individuals which belong to distinct higher clade differ from each other. In this poster, we demonstrate how nonmetric measure of distances between BOL make easier to recognize if each belongs to common higher clade or not. We also show that usual hierarchical clustering like NJ method is not suitable to visualize relationships expressed by nonmetric measure and propose to usage of nonmetric multidimensional scaling (nMDS)[1, 2].

生命条形码(BOL)项目[4]是一个让我们更容易识别物种的项目。虽然定义物种是什么通常很麻烦，但BOL可以通过简单的DNA序列来定义物种。当它起作用时，我们不需要参考DNA序列以外的任何其他信息来决定两个个体是否属于同一物种。如果它们彼此具有相同的BOL，则它们无疑属于同一物种。与此相反，通常很难定义什么是高级进化支。我们不能期望属于同一上Claude的每个个体都具有相同的BOL。相反，我们必须找出属于不同高级分支的个体的BOL是如何彼此不同的。在这张海报中，我们展示了BOL之间的非度量距离如何更容易识别每个BOL是否属于共同的高级分支。我们还证明了通常的分层聚类方法(如NJ方法)不适合将非度量度量表示的关系可视化，并提出了使用非度量多维尺度(nMDS)[1,2]。

引用次数: 0

A Linear Time Algorithm that Infers Hidden Strings from Their Concatenations 从字符串的连接中推断隐藏字符串的线性时间算法

Q3 Biochemistry, Genetics and Molecular Biology

IPSJ Transactions on Bioinformatics

Pub Date : 2008-01-01 DOI: 10.2197/IPSJTBIO.1.13

Tomohiro Yasuda

Let T be a set of hidden strings and S be a set of their concatenations. We address the problem of inferring T from S. Any formalization of the problem as an optimization problem would be computationally hard, because it is NP-complete even to determine whether there exists T smaller than S, and because it is also NP-complete to partition only two strings into the smallest common collection of substrings. In this paper, we devise a new algorithm that infers T by finding common substrings in S and splitting them. This algorithm is scalable and can be completed in O(L)-time regardless of the cardinality of S, where L is the sum of the lengths of all strings in S. In computational experiments, 40, 000 random concatenations of randomly generated strings were successfully decomposed, as well as the effectiveness of our method for this problem was compared with that of multiple sequence alignment programs. We also present the result of a preliminary experiment against the transcriptome of Homo sapiens and describe problems in applications where real large-scale cDNA sequences are analyzed.

设T是隐藏字符串的集合，S是它们的连接的集合。我们解决了从S中推断T的问题，任何将问题形式化为优化问题的计算都是困难的，因为即使确定是否存在小于S的T也是np完全的，并且因为仅将两个字符串划分为最小的公共子字符串集合也是np完全的。在本文中，我们设计了一种新的算法，通过在S中寻找公共子串并拆分它们来推断T。该算法具有可扩展性，无论S的基数如何，都可以在O(L)时间内完成，其中L是S中所有字符串长度的总和。在计算实验中，我们成功地分解了40,000个随机生成的字符串的随机连接，并将我们的方法与多个序列对齐程序的有效性进行了比较。我们还介绍了针对智人转录组的初步实验结果，并描述了在分析真正大规模cDNA序列的应用中存在的问题。

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IPSJ Transactions on Bioinformatics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀