Gene expression analysis is commonly used to analyze millions of gene expression data points. Challenging in this process has been the development of appropriate statistical methods for high-dimensional data. We propose Sparse Learner Boosting for gene expression data analysis. Boosting is performed to minimize the loss function, although this process can cause overfitting when a large number of variables are present. Ordinary boosting utilizes all of the potential weak learners in a given data set and constructs a decision rule. The fundamental idea of Sparse Learner Boosting is to reduce the complexity of the decision rule by using fewer weak learners than is usually required. This reduction prevents overfitting and improves performance during classification. Numerical studies support this modification for high-dimensional data, such as that obtained from gene expression analysis. We show that the proposed modification improves the performance of ordinary boosting methods.
{"title":"Sparse Learner Boosting for Gene Expression Data","authors":"M. Pritchard","doi":"10.2197/IPSJTBIO.3.54","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.3.54","url":null,"abstract":"Gene expression analysis is commonly used to analyze millions of gene expression data points. Challenging in this process has been the development of appropriate statistical methods for high-dimensional data. We propose Sparse Learner Boosting for gene expression data analysis. Boosting is performed to minimize the loss function, although this process can cause overfitting when a large number of variables are present. Ordinary boosting utilizes all of the potential weak learners in a given data set and constructs a decision rule. The fundamental idea of Sparse Learner Boosting is to reduce the complexity of the decision rule by using fewer weak learners than is usually required. This reduction prevents overfitting and improves performance during classification. Numerical studies support this modification for high-dimensional data, such as that obtained from gene expression analysis. We show that the proposed modification improves the performance of ordinary boosting methods.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"3 1","pages":"54-61"},"PeriodicalIF":0.0,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.3.54","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68502206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Glycans, or sugar chains, are one of the three types of chain (DNA, protein and glycan) that constitute living organisms; they are often called “the third chain of the living organism”. About half of all proteins are estimated to be glycosylated based on the SWISS-PROT database. Glycosylation is one of the most important post-translational modifications, affecting many critical functions of proteins, including cellular communication, and their tertiary structure. In order to computationally predict N-glycosylation and O-glycosylation sites, we developed three kinds of support vector machine (SVM) model, which utilize local information, general protein information and/or subcellular localization in consideration of the binding specificity of glycosyltransferases and the characteristic subcellular localization of glycoproteins. Results: In our computational experiment, the model integrating three kinds of information achieved about 90% accuracy in predictions of both N-glycosylation and O-glycosylation sites. Moreover, our model was applied to a protein whose glycosylation sites had not been previously identified and we succeeded in showing that the glycosylation sites predicted by our model were structurally reasonable. Conclusions: In the present study, we developed a comprehensive and effective computational method that detects glycosylation sites. We conclude that our method is a comprehensive and effective computational prediction method that is applicable at a genome-wide level.
{"title":"Support vector machine prediction of N-and O-glycosylation sites using whole sequence information and subcellular localization","authors":"Kenta Sasaki, Nobuyoshi Nagamine, Y. Sakakibara","doi":"10.2197/IPSJTBIO.2.25","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.2.25","url":null,"abstract":"Background: Glycans, or sugar chains, are one of the three types of chain (DNA, protein and glycan) that constitute living organisms; they are often called “the third chain of the living organism”. About half of all proteins are estimated to be glycosylated based on the SWISS-PROT database. Glycosylation is one of the most important post-translational modifications, affecting many critical functions of proteins, including cellular communication, and their tertiary structure. In order to computationally predict N-glycosylation and O-glycosylation sites, we developed three kinds of support vector machine (SVM) model, which utilize local information, general protein information and/or subcellular localization in consideration of the binding specificity of glycosyltransferases and the characteristic subcellular localization of glycoproteins. Results: In our computational experiment, the model integrating three kinds of information achieved about 90% accuracy in predictions of both N-glycosylation and O-glycosylation sites. Moreover, our model was applied to a protein whose glycosylation sites had not been previously identified and we succeeded in showing that the glycosylation sites predicted by our model were structurally reasonable. Conclusions: In the present study, we developed a comprehensive and effective computational method that detects glycosylation sites. We conclude that our method is a comprehensive and effective computational prediction method that is applicable at a genome-wide level.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"2 1","pages":"25-35"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.2.25","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68502318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this study, we use the Ant Colony System (ACS) to develop a heuristic algorithm for sequence alignment. This algorithm is certainly an improvement on ACS-MultiAlignment, which was proposed in 2005 for predicting major histocompatibility complex (MHC) class II binders. The numerical experiments indicate that this algorithm is as much as 2, 900 times faster than the original ACS-MultiAlignment algorithm. We also compare this algorithm to the other approaches such as Gibbs sampling algorithm using numerical experiments. The results show that our algorithm finds the best value prompter than Gibbs approach.
在这项研究中,我们使用蚁群系统(ACS)开发了一种启发式序列比对算法。该算法无疑是ACS-MultiAlignment的改进,ACS-MultiAlignment于2005年提出,用于预测主要组织相容性复合体(MHC) II类结合物。数值实验表明,该算法比原acs - multi - alignment算法快2900倍。并通过数值实验将该算法与Gibbs抽样算法等其他方法进行了比较。结果表明,该算法比Gibbs方法找到了最优的价值提示符。
{"title":"A Modified Algorithm for Sequence Alignment Using Ant Colony System","authors":"A. Mikami, Jianming Shi","doi":"10.2197/IPSJTBIO.2.63","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.2.63","url":null,"abstract":"In this study, we use the Ant Colony System (ACS) to develop a heuristic algorithm for sequence alignment. This algorithm is certainly an improvement on ACS-MultiAlignment, which was proposed in 2005 for predicting major histocompatibility complex (MHC) class II binders. The numerical experiments indicate that this algorithm is as much as 2, 900 times faster than the original ACS-MultiAlignment algorithm. We also compare this algorithm to the other approaches such as Gibbs sampling algorithm using numerical experiments. The results show that our algorithm finds the best value prompter than Gibbs approach.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"2 1","pages":"63-73"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.2.63","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68501961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Comparative analyses of enzymatic reactions provide important information on both evolution and potential pharmacological targets. Previously, we focused on the structural formulae of compounds, and proposed a method to calculate enzymatic similarities based on these formulae. However, with the proposed method it is difficult to measure the reaction similarity when the formulae of the compounds constituting each reaction are completely different. The present study was performed to extract substructures that change within chemical compounds using the RPAIR data in KEGG. Two approaches were applied to measure the similarity between the extracted substructures: a fingerprint-based approach using the MACCS key and the Tanimoto/Jaccard coefficients; and the Topological Fragment Spectra-based approach that does not require any predefined list of substructures. Whether the similarity measures can detect similarity between enzymatic reactions was evaluated. Using one of the similarity measures, metabolic pathways in Escherichia coli were aligned to confirm the effectiveness of the method.
{"title":"Reaction Similarities Focusing Substructure Changes of Chemical Compounds and Metabolic Pathway Alignments","authors":"Y. Tohsato, Yuki Nishimura","doi":"10.2197/IPSJTBIO.2.15","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.2.15","url":null,"abstract":"Comparative analyses of enzymatic reactions provide important information on both evolution and potential pharmacological targets. Previously, we focused on the structural formulae of compounds, and proposed a method to calculate enzymatic similarities based on these formulae. However, with the proposed method it is difficult to measure the reaction similarity when the formulae of the compounds constituting each reaction are completely different. The present study was performed to extract substructures that change within chemical compounds using the RPAIR data in KEGG. Two approaches were applied to measure the similarity between the extracted substructures: a fingerprint-based approach using the MACCS key and the Tanimoto/Jaccard coefficients; and the Topological Fragment Spectra-based approach that does not require any predefined list of substructures. Whether the similarity measures can detect similarity between enzymatic reactions was evaluated. Using one of the similarity measures, metabolic pathways in Escherichia coli were aligned to confirm the effectiveness of the method.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"7 1","pages":"15-24"},"PeriodicalIF":0.0,"publicationDate":"2009-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.2.15","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68502257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the number of documents about protein structural analysis increases, a method of automatically identifying protein names in them is required. However, the accuracy of identification is not high if the training data set is not large enough. We consider a method to extend a training data set based on machine learning using an available corpus. Such a corpus usually consists of documents about a certain kind of organism species, and documents about different kinds of organism species tend to have different vocabularies. Therefore, depending on the target document or corpus, it is not effective for the accurate identification to simply use a corpus as a training data set. In order to improve the accuracy, we propose a method to select sentences that have a positive effect on identification and to extend the training data set with the selected sentences. In the proposed method, a portion of a set of tagged sentences is used as a validation set. The process to select sentences is iterated using the result of the identification of protein names in a validation set as feedback. In the experiment, compared with the baseline, a method without a corpus, with a whole corpus, or with a part of a corpus chosen at random, the accuracy of the proposed method was higher than any baseline method. Thus, it was confirmed that the proposed method selected effective sentences.
{"title":"Selection of Effective Sentences from a Corpus to Improve the Accuracy of Identification of Protein Names","authors":"Kazunori Miyanishi, Tomonobu Ozaki, T. Ohkawa","doi":"10.2197/IPSJTBIO.2.93","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.2.93","url":null,"abstract":"As the number of documents about protein structural analysis increases, a method of automatically identifying protein names in them is required. However, the accuracy of identification is not high if the training data set is not large enough. We consider a method to extend a training data set based on machine learning using an available corpus. Such a corpus usually consists of documents about a certain kind of organism species, and documents about different kinds of organism species tend to have different vocabularies. Therefore, depending on the target document or corpus, it is not effective for the accurate identification to simply use a corpus as a training data set. In order to improve the accuracy, we propose a method to select sentences that have a positive effect on identification and to extend the training data set with the selected sentences. In the proposed method, a portion of a set of tagged sentences is used as a validation set. The process to select sentences is iterated using the result of the identification of protein names in a validation set as feedback. In the experiment, compared with the baseline, a method without a corpus, with a whole corpus, or with a part of a corpus chosen at random, the accuracy of the proposed method was higher than any baseline method. Thus, it was confirmed that the proposed method selected effective sentences.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"2 1","pages":"93-100"},"PeriodicalIF":0.0,"publicationDate":"2009-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.2.93","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68502015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Barcode of Life (BOL) project[4] is the project to enable us to recognize species easier. Although it is often troublesome to define what the species are, BOL can define species by simple DNA sequences. When it works, we do not have to consult with any other information than DNA sequences to decide if two individuals belong to the same species or not. If they share same BOL with each other, they belong to the same species undoubtedly. In contrast to this, it is usually difficult to define what the higher clade are. We cannot expect that each individual which belong to the same upper Claude share the same BOL. Instead, we have to find how BOL of individuals which belong to distinct higher clade differ from each other. In this poster, we demonstrate how nonmetric measure of distances between BOL make easier to recognize if each belongs to common higher clade or not. We also show that usual hierarchical clustering like NJ method is not suitable to visualize relationships expressed by nonmetric measure and propose to usage of nonmetric multidimensional scaling (nMDS)[1, 2].
{"title":"Nonmetric Distances for Barcode of Life","authors":"H. Akiba, Y-h. Taguchi","doi":"10.2197/IPSJTBIO.1.35","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.1.35","url":null,"abstract":"Barcode of Life (BOL) project[4] is the project to enable us to recognize species easier. Although it is often troublesome to define what the species are, BOL can define species by simple DNA sequences. When it works, we do not have to consult with any other information than DNA sequences to decide if two individuals belong to the same species or not. If they share same BOL with each other, they belong to the same species undoubtedly. In contrast to this, it is usually difficult to define what the higher clade are. We cannot expect that each individual which belong to the same upper Claude share the same BOL. Instead, we have to find how BOL of individuals which belong to distinct higher clade differ from each other. In this poster, we demonstrate how nonmetric measure of distances between BOL make easier to recognize if each belongs to common higher clade or not. We also show that usual hierarchical clustering like NJ method is not suitable to visualize relationships expressed by nonmetric measure and propose to usage of nonmetric multidimensional scaling (nMDS)[1, 2].","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"1 1","pages":"35-41"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.1.35","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68500635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Let T be a set of hidden strings and S be a set of their concatenations. We address the problem of inferring T from S. Any formalization of the problem as an optimization problem would be computationally hard, because it is NP-complete even to determine whether there exists T smaller than S, and because it is also NP-complete to partition only two strings into the smallest common collection of substrings. In this paper, we devise a new algorithm that infers T by finding common substrings in S and splitting them. This algorithm is scalable and can be completed in O(L)-time regardless of the cardinality of S, where L is the sum of the lengths of all strings in S. In computational experiments, 40, 000 random concatenations of randomly generated strings were successfully decomposed, as well as the effectiveness of our method for this problem was compared with that of multiple sequence alignment programs. We also present the result of a preliminary experiment against the transcriptome of Homo sapiens and describe problems in applications where real large-scale cDNA sequences are analyzed.
{"title":"A Linear Time Algorithm that Infers Hidden Strings from Their Concatenations","authors":"Tomohiro Yasuda","doi":"10.2197/IPSJTBIO.1.13","DOIUrl":"https://doi.org/10.2197/IPSJTBIO.1.13","url":null,"abstract":"Let T be a set of hidden strings and S be a set of their concatenations. We address the problem of inferring T from S. Any formalization of the problem as an optimization problem would be computationally hard, because it is NP-complete even to determine whether there exists T smaller than S, and because it is also NP-complete to partition only two strings into the smallest common collection of substrings. In this paper, we devise a new algorithm that infers T by finding common substrings in S and splitting them. This algorithm is scalable and can be completed in O(L)-time regardless of the cardinality of S, where L is the sum of the lengths of all strings in S. In computational experiments, 40, 000 random concatenations of randomly generated strings were successfully decomposed, as well as the effectiveness of our method for this problem was compared with that of multiple sequence alignment programs. We also present the result of a preliminary experiment against the transcriptome of Homo sapiens and describe problems in applications where real large-scale cDNA sequences are analyzed.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"1 1","pages":"13-22"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.1.13","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68500499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}