Younghoon Kim, Doheon Lee, Yongseong Cho, Sang Joo Lee
Motivation: Although gene expression data has been continuously accumulated and meta-analysis approaches have been developed to integrate independent expression profiles into larger datasets, the amount of information is still insufficient to infer large scale genetic networks. In addition, global optimization such as Bayesian network inference, one of the most representative techniques for genetic network inference, requires tremendous computational load far beyond the capacity of moderate workstations. Results: MONET is a Cytoscape plugin to infer genome-scale networks from gene expression profiles. It alleviates the shortage of information by incorporating pre-existing annotations. The current version of MONET utilizes thousands of parallel computational cores in the supercomputing center in KISTI, Korea, to cope with the computational requirement for large scale genetic network inference. Availability: A cytoscape plugin is available at http://cytoscape.org and a web service is at http://delsol.kaist.ac.kr/~monet/home
{"title":"A large-scale gene network inference system for systems biology on supercomputing resources","authors":"Younghoon Kim, Doheon Lee, Yongseong Cho, Sang Joo Lee","doi":"10.1145/1651318.1651340","DOIUrl":"https://doi.org/10.1145/1651318.1651340","url":null,"abstract":"Motivation: Although gene expression data has been continuously accumulated and meta-analysis approaches have been developed to integrate independent expression profiles into larger datasets, the amount of information is still insufficient to infer large scale genetic networks. In addition, global optimization such as Bayesian network inference, one of the most representative techniques for genetic network inference, requires tremendous computational load far beyond the capacity of moderate workstations.\u0000 Results: MONET is a Cytoscape plugin to infer genome-scale networks from gene expression profiles. It alleviates the shortage of information by incorporating pre-existing annotations. The current version of MONET utilizes thousands of parallel computational cores in the supercomputing center in KISTI, Korea, to cope with the computational requirement for large scale genetic network inference.\u0000 Availability: A cytoscape plugin is available at http://cytoscape.org and a web service is at http://delsol.kaist.ac.kr/~monet/home","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124687140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ikumi Suzuki, Kazuo Hara, M. Shimbo, Yuji Matsumoto
The addition of new terms to biomedical thesauri is important for keeping pace with new research. In the context of a thesaurus expansion task, we investigate the property of Laplacian diffusion kernel matrices that depreciate pivotal vertices having many links to surrounding vertices. We confirm that this property can be seen on the Laplacian matrix of a graph that we construct from the GENIA corpus (a subset of MEDLINE abstracts) and simulate thesaurus expansion by employing either the Laplacian diffusion kernel matrix, or the adjacency matrix (i.e., cosine similarity), to determine the correct position for new biomedical terms being added to the MeSH thesaurus. Whilst results do not show the desired precision, our approach is shown to be complementary to calculation of cosine similarity between thesaurus terms and we recognize directions for future work.
{"title":"A graph-based approach for biomedical thesaurus expansion","authors":"Ikumi Suzuki, Kazuo Hara, M. Shimbo, Yuji Matsumoto","doi":"10.1145/1651318.1651336","DOIUrl":"https://doi.org/10.1145/1651318.1651336","url":null,"abstract":"The addition of new terms to biomedical thesauri is important for keeping pace with new research. In the context of a thesaurus expansion task, we investigate the property of Laplacian diffusion kernel matrices that depreciate pivotal vertices having many links to surrounding vertices. We confirm that this property can be seen on the Laplacian matrix of a graph that we construct from the GENIA corpus (a subset of MEDLINE abstracts) and simulate thesaurus expansion by employing either the Laplacian diffusion kernel matrix, or the adjacency matrix (i.e., cosine similarity), to determine the correct position for new biomedical terms being added to the MeSH thesaurus. Whilst results do not show the desired precision, our approach is shown to be complementary to calculation of cosine similarity between thesaurus terms and we recognize directions for future work.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127182024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human Endogenous RetroViruses(HERVs) are suggested that they have a function of regulating the activity of human genes and could produce protein in some conditions. So it is crucial to examine the physical layout relationship between HERVs and genes in the whole genome scale. In this paper we present RetroScope, a new Web-based comparative visualization system for HERV over 4 whole primate genomes including Human, Chimpanzee, Orangutan and Rhesus monkey. So RetroScope enables us to find the retro element which is very close to a specified gene in the form of exonoverlapping or promotor, primer overlapping. Thus our system enables biologist to provide global understanding by comparing the linear configuration of several HERVs in the whole chromosome scales by using a fast HERV alignment algorithm. Also by alignment of HERVs, we can find the most similar pair of chromosomes with respect to the configuration of HERV elements, which would be another clues to construct phylogenetics based on HERV. RetroScope is available on http://neobio.cs.pusan.ac.kr/sretroscope/.
{"title":"A web-based comparative visualization system for human endogenous RetroVirus(HERV) on whole genomes","authors":"Woo-Keun Chung, Hyong-Jun Kim, Hwan-Gue Cho","doi":"10.1145/1651318.1651333","DOIUrl":"https://doi.org/10.1145/1651318.1651333","url":null,"abstract":"Human Endogenous RetroViruses(HERVs) are suggested that they have a function of regulating the activity of human genes and could produce protein in some conditions. So it is crucial to examine the physical layout relationship between HERVs and genes in the whole genome scale. In this paper we present RetroScope, a new Web-based comparative visualization system for HERV over 4 whole primate genomes including Human, Chimpanzee, Orangutan and Rhesus monkey. So RetroScope enables us to find the retro element which is very close to a specified gene in the form of exonoverlapping or promotor, primer overlapping. Thus our system enables biologist to provide global understanding by comparing the linear configuration of several HERVs in the whole chromosome scales by using a fast HERV alignment algorithm. Also by alignment of HERVs, we can find the most similar pair of chromosomes with respect to the configuration of HERV elements, which would be another clues to construct phylogenetics based on HERV. RetroScope is available on http://neobio.cs.pusan.ac.kr/sretroscope/.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129235225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This project assembles a virtual team consisting of personnel from the New Jersey Institute of Technology with expertise in the data mining domain and the Saint Barnabas Health Care System with expertise in the medical domain. We apply proven techniques in data and text mining to the problem of hospital mortality. Methodology in outcomes research using data/text mining has typically included Bayesian Networks to include decision trees and rules, regression analysis or Neural Networks/Support Vector Machines to analyze a single disease or condition. We propose to instead analyze the entire spectrum of reasons patients are admitted to a hospital in an effort to discern what chronologies result in good outcomes and which in the worst outcome so as to identify the characteristics to be avoided throughout the spectrum of reasons for admission.
{"title":"An outcome discovery system to determine mortality factors in primary care facilities","authors":"Jeremias Murillo, Min Song","doi":"10.1145/1651318.1651341","DOIUrl":"https://doi.org/10.1145/1651318.1651341","url":null,"abstract":"This project assembles a virtual team consisting of personnel from the New Jersey Institute of Technology with expertise in the data mining domain and the Saint Barnabas Health Care System with expertise in the medical domain. We apply proven techniques in data and text mining to the problem of hospital mortality. Methodology in outcomes research using data/text mining has typically included Bayesian Networks to include decision trees and rules, regression analysis or Neural Networks/Support Vector Machines to analyze a single disease or condition. We propose to instead analyze the entire spectrum of reasons patients are admitted to a hospital in an effort to discern what chronologies result in good outcomes and which in the worst outcome so as to identify the characteristics to be avoided throughout the spectrum of reasons for admission.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128447189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this talk I will discuss some data mining techniques and methods in the bioinformatics domain, what are the main challenges and what are the opportunities. I will cover some of the issues related to biomedical literature mining, bioinformatics data integration and biological network analysis and simulation. In biomedical literature mining, I will discuss the effective information retrieval and large-scale information extraction from biomedical literatures. I will also share my view of the semantic-based approach for data integration for bioinformatics domain. In the end, I will talk about the various approaches for biological network analysis and simulation.
{"title":"Data mining in bioinformatics: challenges and opportunities","authors":"Xiaohua Hu","doi":"10.1145/1651318.1651320","DOIUrl":"https://doi.org/10.1145/1651318.1651320","url":null,"abstract":"In this talk I will discuss some data mining techniques and methods in the bioinformatics domain, what are the main challenges and what are the opportunities. I will cover some of the issues related to biomedical literature mining, bioinformatics data integration and biological network analysis and simulation. In biomedical literature mining, I will discuss the effective information retrieval and large-scale information extraction from biomedical literatures. I will also share my view of the semantic-based approach for data integration for bioinformatics domain. In the end, I will talk about the various approaches for biological network analysis and simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125373886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The microarray is gaining popularity in biomedical research due to its ability to analyze hundreds to thousands of genes simultaneously in one experiment. However, the unique nature of microarray data, with a large number of features but relative small number of samples, poses challenges to process the microarray data effectively. The curse of dimensionality introduces the importance of feature extraction in analyzing microarray data. Therefore, we propose a novel incremental method to discover the non-Gaussian weight from the microarray gene expression data with high efficiency. Our proposed method can discover a small number of compact features from a huge number of genes and can still achieve good predictive performance. It integrates non-gaussianity and an adaptive incremental model in an unsupervised way to extract informative features. It is also plausible to analyze microarray data with the number of features much larger than number of observations with promising results.
{"title":"Incremental non-gaussian analysis of microarray gene expression data","authors":"Kam Swee Ng, Hyung-Jeong Yang, Sun-Hee Kim","doi":"10.1145/1651318.1651334","DOIUrl":"https://doi.org/10.1145/1651318.1651334","url":null,"abstract":"The microarray is gaining popularity in biomedical research due to its ability to analyze hundreds to thousands of genes simultaneously in one experiment. However, the unique nature of microarray data, with a large number of features but relative small number of samples, poses challenges to process the microarray data effectively. The curse of dimensionality introduces the importance of feature extraction in analyzing microarray data. Therefore, we propose a novel incremental method to discover the non-Gaussian weight from the microarray gene expression data with high efficiency. Our proposed method can discover a small number of compact features from a huge number of genes and can still achieve good predictive performance. It integrates non-gaussianity and an adaptive incremental model in an unsupervised way to extract informative features. It is also plausible to analyze microarray data with the number of features much larger than number of observations with promising results.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130290777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Analysis of the robustness of a metabolic network against of single or multiple reaction(s) is useful for mining important enzymes/genes. For that purpose, the impact degree was proposed by Jiang et al. In this short paper, we extend the impact degree for metabolic networks containing cycles and develop a simple algorithm for its computation. Furthermore, we propose an improved algorithm for computing impact degrees for deletions of multiple reactions. The results of preliminary computational experiments suggest that the improved algorithm is several tens of times faster than a simple algorithm.
{"title":"Efficient computation of impact degrees for multiple reactions in metabolic networks with cycles","authors":"Yang Cong, Takeyuki Tamura, T. Akutsu, W. Ching","doi":"10.1145/1651318.1651332","DOIUrl":"https://doi.org/10.1145/1651318.1651332","url":null,"abstract":"Analysis of the robustness of a metabolic network against of single or multiple reaction(s) is useful for mining important enzymes/genes. For that purpose, the impact degree was proposed by Jiang et al. In this short paper, we extend the impact degree for metabolic networks containing cycles and develop a simple algorithm for its computation. Furthermore, we propose an improved algorithm for computing impact degrees for deletions of multiple reactions. The results of preliminary computational experiments suggest that the improved algorithm is several tens of times faster than a simple algorithm.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125984868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are building the Pharmacogenetics & Pharmacogenomics Knowledgebase (PharmGKB, http://www.pharmgkb.org/) with the goal of cataloguing all knowledge about how genetic variation impacts drug response phenotypes. PharmGKB stores primary data (genotype and phenotype data) as well as more distilled knowledge in the form of pathway diagrams, annotated summaries of very important pharmacogenes (VIP genes), and annotated literature. The literature annotation efforts include both manual curation by trained curators and automatic information extraction. In this talk, I will discuss three projects relevant to our efforts in literature curation: 1. The Pharmspresso project is a simple rule-based system for extracting mentions of gene, drug, disease and polymorphism interactions from text. It is based on the Textpresso system developed at Caltech, but adds specific rules about human drugs, genes and phenotypes. The initial version of Pharmspresso had good performance, but suffered from false positive extractions, and so we have been working to improve the performance, while maintaining as much generality as possible. Pharmspresso is available athttp://pharmspresso.stanford.edu/ 2. The PGxPipeline project builds on the gene-drug-disease associations mined both manually and automatically to do scientific discovery. A critical bottleneck in pharmacogenetics is identifying genes that are likely to be important for modifying drug response. Unless the full details of drug action and metabolism are understood, any of the ~25,000 human genes could be important for understanding action and metabolism. PgxPipeline is built to accept as input a drug and an indication for use (e.g. pain or high cholesterol). It then uses both information from the literature as well as information about chemical structure to rank order all genes in the human genome with respect to the likelihood that they interact with the drug of interest. In this way, we can prioritize the genes that are most likely to be relevant to the drug. We have found that our rank order lists are useful adjuncts to other independent sources of information, and work best in combination with these. 3. Finally, we have been studying the sites in proteins that bind small molecules (such as drugs) or are important as active sites where the proteins' functions occur. We have clustered these sites based on structural similarity to discover new structural motifs associated with protein function. Very often, we have no knowledge of the function of these newly discovered structural motifs, but the literature often has substantial information about the function of the proteins to which these motifs belong. Our final project, then, is focused on gathering the literature associated with proteins that have a common motif, and determining what words/concepts are likely to describe the common functions of these proteins, and therefore be the likely significance of these shared structural motifs.
{"title":"Text mining for pharmacogenomics","authors":"R. Altman","doi":"10.1145/1458449.1458451","DOIUrl":"https://doi.org/10.1145/1458449.1458451","url":null,"abstract":"We are building the Pharmacogenetics & Pharmacogenomics Knowledgebase (PharmGKB, http://www.pharmgkb.org/) with the goal of cataloguing all knowledge about how genetic variation impacts drug response phenotypes. PharmGKB stores primary data (genotype and phenotype data) as well as more distilled knowledge in the form of pathway diagrams, annotated summaries of very important pharmacogenes (VIP genes), and annotated literature. The literature annotation efforts include both manual curation by trained curators and automatic information extraction. In this talk, I will discuss three projects relevant to our efforts in literature curation:\u0000 1. The Pharmspresso project is a simple rule-based system for extracting mentions of gene, drug, disease and polymorphism interactions from text. It is based on the Textpresso system developed at Caltech, but adds specific rules about human drugs, genes and phenotypes. The initial version of Pharmspresso had good performance, but suffered from false positive extractions, and so we have been working to improve the performance, while maintaining as much generality as possible. Pharmspresso is available athttp://pharmspresso.stanford.edu/\u0000 2. The PGxPipeline project builds on the gene-drug-disease associations mined both manually and automatically to do scientific discovery. A critical bottleneck in pharmacogenetics is identifying genes that are likely to be important for modifying drug response. Unless the full details of drug action and metabolism are understood, any of the ~25,000 human genes could be important for understanding action and metabolism. PgxPipeline is built to accept as input a drug and an indication for use (e.g. pain or high cholesterol). It then uses both information from the literature as well as information about chemical structure to rank order all genes in the human genome with respect to the likelihood that they interact with the drug of interest. In this way, we can prioritize the genes that are most likely to be relevant to the drug. We have found that our rank order lists are useful adjuncts to other independent sources of information, and work best in combination with these.\u0000 3. Finally, we have been studying the sites in proteins that bind small molecules (such as drugs) or are important as active sites where the proteins' functions occur. We have clustered these sites based on structural similarity to discover new structural motifs associated with protein function. Very often, we have no knowledge of the function of these newly discovered structural motifs, but the literature often has substantial information about the function of the proteins to which these motifs belong. Our final project, then, is focused on gathering the literature associated with proteins that have a common motif, and determining what words/concepts are likely to describe the common functions of these proteins, and therefore be the likely significance of these shared structural motifs.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"6302 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126318411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Metastasis is the most dangerous step in cancer progression and causes more than 90% of cancer death. Although many researchers have been working on biological features and characteristics of metastasis, most of its genetic level processes remain uncertain. Some studies succeeded in elucidating metastasis related genes and pathways, followed by predicting prognosis of cancer patients, but there still is a question whether the result genes or pathways contain enough information and noise features have been controlled appropriately. To address these problems, we conducted comparisons between primary tumors and secondary metastatic tumors. Noises from the differences of tissue specific characteristics between two types of tumors have been controlled by additional analyses. In this paper, we suggest a new method for identifying genes and pathways which secure metastasis dependency and are free of metastasis independent features.
{"title":"Mining metastasis related genes by primary-secondary tumor comparisons from large-scale database","authors":"Sangwoo Kim, Doheon Lee","doi":"10.1145/1458449.1458458","DOIUrl":"https://doi.org/10.1145/1458449.1458458","url":null,"abstract":"Metastasis is the most dangerous step in cancer progression and causes more than 90% of cancer death. Although many researchers have been working on biological features and characteristics of metastasis, most of its genetic level processes remain uncertain. Some studies succeeded in elucidating metastasis related genes and pathways, followed by predicting prognosis of cancer patients, but there still is a question whether the result genes or pathways contain enough information and noise features have been controlled appropriately. To address these problems, we conducted comparisons between primary tumors and secondary metastatic tumors. Noises from the differences of tissue specific characteristics between two types of tumors have been controlled by additional analyses. In this paper, we suggest a new method for identifying genes and pathways which secure metastasis dependency and are free of metastasis independent features.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126177324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timur Fayruzov, M. D. Cock, C. Cornelis, Veronique Hoste
Most approaches for protein interaction mining from biomedical texts use both lexical and syntactic features. However, the individual impact of these two kinds of features on the effectiveness of the mining process has not yet been thoroughly studied. In this paper, we perform such a study on a recently published state of the art support vector machine approach that uses both lexical and syntactic features. To this end, we strip this approach down to an algorithm that uses only a subset of the initial syntactic features. Next, we compare the original and the stripped-down method by evaluating them on 5 benchmark datasets as well as by performing 5 additional cross-dataset experiments. Although the original method exploits a very rich feature set including words, parts-of-speech and grammatical relations, it is not significantly better than the stripped-down version; in fact, the former does not even consistently outperform the latter.
{"title":"The role of syntactic features in protein interaction extraction","authors":"Timur Fayruzov, M. D. Cock, C. Cornelis, Veronique Hoste","doi":"10.1145/1458449.1458463","DOIUrl":"https://doi.org/10.1145/1458449.1458463","url":null,"abstract":"Most approaches for protein interaction mining from biomedical texts use both lexical and syntactic features. However, the individual impact of these two kinds of features on the effectiveness of the mining process has not yet been thoroughly studied. In this paper, we perform such a study on a recently published state of the art support vector machine approach that uses both lexical and syntactic features. To this end, we strip this approach down to an algorithm that uses only a subset of the initial syntactic features. Next, we compare the original and the stripped-down method by evaluating them on 5 benchmark datasets as well as by performing 5 additional cross-dataset experiments. Although the original method exploits a very rich feature set including words, parts-of-speech and grammatical relations, it is not significantly better than the stripped-down version; in fact, the former does not even consistently outperform the latter.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116372941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}