S. Song, G. Heo, Ha Jin Kim, H. Jung, Yonghwan Kim, Min Song
Relation extraction is an important task in biomedical areas such as protein-protein interaction, gene-disease interactions, and drug-disease interactions. In recent years, it has been widely researched to automatically extract biomedical relations in a vest amount of biomedical text data. In this paper, we propose a hybrid approach to extracting relations based on a rule-based approach feature set. We then use different classification algorithms such as SVM, Naïve Bayes, and Decision Tree classifiers for relation classification. The rationale for adopting shallow parsing and other NLP techniques to extract relations is two-folds: simplicity and robustness. We select seven features with the rule-based shallow parsing technique and evaluate the performance with four different PPI public corpora. Our experimental results show the stable performance in F-measure even with the relatively fewer features.
{"title":"Grounded Feature Selection for Biomedical Relation Extraction by the Combinative Approach","authors":"S. Song, G. Heo, Ha Jin Kim, H. Jung, Yonghwan Kim, Min Song","doi":"10.1145/2665970.2665975","DOIUrl":"https://doi.org/10.1145/2665970.2665975","url":null,"abstract":"Relation extraction is an important task in biomedical areas such as protein-protein interaction, gene-disease interactions, and drug-disease interactions. In recent years, it has been widely researched to automatically extract biomedical relations in a vest amount of biomedical text data. In this paper, we propose a hybrid approach to extracting relations based on a rule-based approach feature set. We then use different classification algorithms such as SVM, Naïve Bayes, and Decision Tree classifiers for relation classification. The rationale for adopting shallow parsing and other NLP techniques to extract relations is two-folds: simplicity and robustness. We select seven features with the rule-based shallow parsing technique and evaluate the performance with four different PPI public corpora. Our experimental results show the stable performance in F-measure even with the relatively fewer features.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134620168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suhyun Ha, Sunyong Yoo, Moonshik Shin, J. Kwak, O. Kwon, M. Choi, K. Kang, Hojung Nam, Doheon Lee
Natural products used in dietary supplements, complementary and alternative medicine (CAM) and conventional medicine are composites of multiple chemical compounds. These chemical compounds potentially offer an extensive source for drug discovery with accumulated knowledge of efficacy and safety. However, existing natural product related databases have drawbacks in both standardization and structuralization of information. Therefore, in this work, we construct an integrated database of natural products by mapping the prescription, herb, compound, and phenotype information to international identifiers and structuralizing the efficacy information through database integration and text-mining methods. We expect that the constructed database could serve as a fundamental resource for the natural products research.
{"title":"Integrative Database for Exploring Compound Combinations of Natural Products for Medical Effects","authors":"Suhyun Ha, Sunyong Yoo, Moonshik Shin, J. Kwak, O. Kwon, M. Choi, K. Kang, Hojung Nam, Doheon Lee","doi":"10.1145/2665970.2665986","DOIUrl":"https://doi.org/10.1145/2665970.2665986","url":null,"abstract":"Natural products used in dietary supplements, complementary and alternative medicine (CAM) and conventional medicine are composites of multiple chemical compounds. These chemical compounds potentially offer an extensive source for drug discovery with accumulated knowledge of efficacy and safety. However, existing natural product related databases have drawbacks in both standardization and structuralization of information. Therefore, in this work, we construct an integrated database of natural products by mapping the prescription, herb, compound, and phenotype information to international identifiers and structuralizing the efficacy information through database integration and text-mining methods. We expect that the constructed database could serve as a fundamental resource for the natural products research.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115450589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kwangmin Kim, Sejoon Lee, Kyunghyun Park, Dongjin Jang, Doheon Lee
Computer-based virtual human model is believed to be the promising solution for drug response identification. Literature mining is competitive method to extract those biological rules for human model simulation, since existing public databases provide only limited amount of information applicable for the simulation. Here we propose the method for mining context-specific rules from the literature, for future application to virtual human model simulation. Integrating the existing biological databases, we have constructed formalized ontology. From the PubMed literature, we have tagged 11 distinct types of biological entities using both of conditional random field (CRF) and dictionary based Named Entity Recognition (NER). Recognized named entities were normalized and mapped to formalized ontology. Context-specific biological rules between named entities, characterized by increase/decrease features, were extracted by pattern-based method utilizing regular expression. As the result, we have obtained the organ-context specific biological rules. Further researches on enhanced rule and context extraction will be followed.
{"title":"Mining Context-Specific Rules from the Literature for Virtual Human Model Simulation","authors":"Kwangmin Kim, Sejoon Lee, Kyunghyun Park, Dongjin Jang, Doheon Lee","doi":"10.1145/2665970.2665987","DOIUrl":"https://doi.org/10.1145/2665970.2665987","url":null,"abstract":"Computer-based virtual human model is believed to be the promising solution for drug response identification. Literature mining is competitive method to extract those biological rules for human model simulation, since existing public databases provide only limited amount of information applicable for the simulation. Here we propose the method for mining context-specific rules from the literature, for future application to virtual human model simulation. Integrating the existing biological databases, we have constructed formalized ontology. From the PubMed literature, we have tagged 11 distinct types of biological entities using both of conditional random field (CRF) and dictionary based Named Entity Recognition (NER). Recognized named entities were normalized and mapped to formalized ontology. Context-specific biological rules between named entities, characterized by increase/decrease features, were extracted by pattern-based method utilizing regular expression. As the result, we have obtained the organ-context specific biological rules. Further researches on enhanced rule and context extraction will be followed.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129247469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biomedical literature from PubMed contains various types of entities such as diseases or organisms. The rapid growth of their size makes it harder to conceptualized; however, displaying the natural terms that occurred in the text is more effective in understanding the target corpus to be searched than suggesting a concept related to a user query. Thus, we consider the natural common words that biomedical information users actually write and speak. We extract bio-related terms from the corpus mapping with the UMLS. We show entity-based networks with natural language terms as they are shown in the text. In this paper, we present simple and precise associative networks of natural terms in the biomedical literature. The entity-based networks and entity relations can make understanding the biomedical literature corpus more effective and easier by detecting related terms and their hidden relations in the documents. We considered bio-entities and their relations in the biomedical literature and focused on the representation of a graphic display that can improve users' perception about a large corpus. To this end, epidemiology as an experimental domain was chosen and we extract entities from the corpus mapping the UMLS and draw their relations inferred by the Semantic Network of the UMLS. Then we calculate term frequencies, co-occurrences, and term pair similarities (See Figure 1). In results, distinguished networks that display conceptual structures in the biomedical literature with a natural language and not a concept were demonstrated (See Figure 2). The networks we present provide more comprehension of the biomedical collection.
{"title":"A Display of Conceptual Structures in the Epidemiologic Literature","authors":"E. H. Kim, S. Song, Yonghwan Kim, Min Song","doi":"10.1145/2665970.2665983","DOIUrl":"https://doi.org/10.1145/2665970.2665983","url":null,"abstract":"Biomedical literature from PubMed contains various types of entities such as diseases or organisms. The rapid growth of their size makes it harder to conceptualized; however, displaying the natural terms that occurred in the text is more effective in understanding the target corpus to be searched than suggesting a concept related to a user query. Thus, we consider the natural common words that biomedical information users actually write and speak. We extract bio-related terms from the corpus mapping with the UMLS. We show entity-based networks with natural language terms as they are shown in the text. In this paper, we present simple and precise associative networks of natural terms in the biomedical literature. The entity-based networks and entity relations can make understanding the biomedical literature corpus more effective and easier by detecting related terms and their hidden relations in the documents. We considered bio-entities and their relations in the biomedical literature and focused on the representation of a graphic display that can improve users' perception about a large corpus. To this end, epidemiology as an experimental domain was chosen and we extract entities from the corpus mapping the UMLS and draw their relations inferred by the Semantic Network of the UMLS. Then we calculate term frequencies, co-occurrences, and term pair similarities (See Figure 1). In results, distinguished networks that display conceptual structures in the biomedical literature with a natural language and not a concept were demonstrated (See Figure 2). The networks we present provide more comprehension of the biomedical collection.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132066107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tumor stratification is one of the basic tasks in cancer genomics for a better understanding of the tumor heterogeneity and better targeted treatments. There are various biological data that can be used to stratify tumors including gene expression and sequencing data. In this work, we use the somatic mutation data. Two types of somatic mutation profiles are generated and clustered using k-means clustering with appropriate distance measures to obtain cancer subtypes for each cancer type: binary somatic mutation profile and weighted somatic mutation profile. According to the predictive power of clinical features and survival time of the identified subtypes, the binary somatic mutation profile with Jaccard distance (B-Jac) performed the best and the weighted somatic mutation profile with Euclidean distance (W-Euc) performed comparably. Both approaches performed significantly better than the typical usage of somatic mutation, i.e. the binary somatic mutation profile with Euclidean distance (B-Euc).
{"title":"Identifying Cancer Subtypes based on Somatic Mutation Profile","authors":"Sungchul Kim, Lee Sael, Hwanjo Yu","doi":"10.1145/2665970.2665980","DOIUrl":"https://doi.org/10.1145/2665970.2665980","url":null,"abstract":"Tumor stratification is one of the basic tasks in cancer genomics for a better understanding of the tumor heterogeneity and better targeted treatments. There are various biological data that can be used to stratify tumors including gene expression and sequencing data. In this work, we use the somatic mutation data. Two types of somatic mutation profiles are generated and clustered using k-means clustering with appropriate distance measures to obtain cancer subtypes for each cancer type: binary somatic mutation profile and weighted somatic mutation profile. According to the predictive power of clinical features and survival time of the identified subtypes, the binary somatic mutation profile with Jaccard distance (B-Jac) performed the best and the weighted somatic mutation profile with Euclidean distance (W-Euc) performed comparably. Both approaches performed significantly better than the typical usage of somatic mutation, i.e. the binary somatic mutation profile with Euclidean distance (B-Euc).","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133384810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seyeol Yoon, J. Jung, Hasun Yu, Mijin Kwon, Sungji Choo, Kyunghyun Park, Dongjin Jang, Sangwoo Kim, Doheon Lee
Interactions between biological entities such as genes, proteins and metabolites, so called pathways, are key features to understand molecular mechanisms of life. As pathway information is being accumulated rapidly through various knowledge resources, there are growing interests in maintaining integrity of the heterogeneous databases. Here, we defined conflict as a status where two contradictory evidences (i.e. 'A increases B' and 'A decreases B') coexist in a same pathway. This conflict damages unity so that inference of simulation on the integrated pathway network might be unreliable. We defined rule and rule group. A rule consists of interaction of two entities, meta-relation (increase or decrease), and contexts terms about tissue specificity or environmental conditions. The rules, which have the same interaction, are grouped into a rule group. If the rules don't have unanimous meta-relation, the rule group and the rules are judged as being conflicting. This analysis revealed that almost 20% of known interactions suffer from conflicting information and conflicting information occurred much more frequently in the literatures than the public database. With consideration for dual functions depending on context, we thought it might resolve conflict to consider context. We grouped rules, which have the same context terms as well as interaction. It's revealed that up to 86% of the conflicts could be resolved by considering context. Subsequent analysis also showed that those contradictory records generally compete each other closely, but some information might be suspicious when their evidence levels are seriously imbalanced. By identifying and resolving the conflicts, we expect that pathway databases can be cleaned and used for better secondary analyses such as gene/protein annotation, network dynamics and qualitative/quantitative simulation.
{"title":"Systematic Identification of Context-dependent Conflicting Information in Biological Pathways","authors":"Seyeol Yoon, J. Jung, Hasun Yu, Mijin Kwon, Sungji Choo, Kyunghyun Park, Dongjin Jang, Sangwoo Kim, Doheon Lee","doi":"10.1145/2665970.2665973","DOIUrl":"https://doi.org/10.1145/2665970.2665973","url":null,"abstract":"Interactions between biological entities such as genes, proteins and metabolites, so called pathways, are key features to understand molecular mechanisms of life. As pathway information is being accumulated rapidly through various knowledge resources, there are growing interests in maintaining integrity of the heterogeneous databases. Here, we defined conflict as a status where two contradictory evidences (i.e. 'A increases B' and 'A decreases B') coexist in a same pathway. This conflict damages unity so that inference of simulation on the integrated pathway network might be unreliable. We defined rule and rule group. A rule consists of interaction of two entities, meta-relation (increase or decrease), and contexts terms about tissue specificity or environmental conditions. The rules, which have the same interaction, are grouped into a rule group. If the rules don't have unanimous meta-relation, the rule group and the rules are judged as being conflicting. This analysis revealed that almost 20% of known interactions suffer from conflicting information and conflicting information occurred much more frequently in the literatures than the public database. With consideration for dual functions depending on context, we thought it might resolve conflict to consider context. We grouped rules, which have the same context terms as well as interaction. It's revealed that up to 86% of the conflicts could be resolved by considering context. Subsequent analysis also showed that those contradictory records generally compete each other closely, but some information might be suspicious when their evidence levels are seriously imbalanced. By identifying and resolving the conflicts, we expect that pathway databases can be cleaned and used for better secondary analyses such as gene/protein annotation, network dynamics and qualitative/quantitative simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130130017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.
{"title":"Injury Narrative Text Classification: A Preliminary Study","authors":"Lin Chen, K. Vallmuur, R. Nayak","doi":"10.1145/2665970.2665976","DOIUrl":"https://doi.org/10.1145/2665970.2665976","url":null,"abstract":"Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000.\u0000 The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built.\u0000 In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134370820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Jung, Hasun Yu, Seyeol Yoon, Mijin Kwon, Sungji Choo, Sangwoo Kim, Doheon Lee
Inferring drug-induced phenotypes via computational approaches can give a substantial support to drug discovery procedure. However, existing computational models that are mainly based on a single cell or a single organ model are thought to be limited because the phenotypes are consequences of stochastic biochemical processes among distant cells/organs as well as molecules confined in one cell. Therefore, there is an urgent demand for a new computational model that represents heterogeneous biochemical interactions spanning the entire human body. To meet the demand, we constructed multi-level networks that incorporate previously uncovered high-level properties such as molecules, cells, organs, and phenotypes. Currently, the networks consist of 1,776,506 edges including molecular networks within 76 pre-defined cell-types, inter-cell interactions among the cell-types, and gene (protein) relations to 429 phenotypes. We are also planning to verify if known drug-induced phenotypes are reproducible in the networks using a Petri-net based simulation.
{"title":"Construction of Multi-level Networks Incorporating Molecule, Cell, Organ and Phenotype Properties for Drug-induced Phenotype Prediction","authors":"J. Jung, Hasun Yu, Seyeol Yoon, Mijin Kwon, Sungji Choo, Sangwoo Kim, Doheon Lee","doi":"10.1145/2665970.2665989","DOIUrl":"https://doi.org/10.1145/2665970.2665989","url":null,"abstract":"Inferring drug-induced phenotypes via computational approaches can give a substantial support to drug discovery procedure. However, existing computational models that are mainly based on a single cell or a single organ model are thought to be limited because the phenotypes are consequences of stochastic biochemical processes among distant cells/organs as well as molecules confined in one cell. Therefore, there is an urgent demand for a new computational model that represents heterogeneous biochemical interactions spanning the entire human body. To meet the demand, we constructed multi-level networks that incorporate previously uncovered high-level properties such as molecules, cells, organs, and phenotypes. Currently, the networks consist of 1,776,506 edges including molecular networks within 76 pre-defined cell-types, inter-cell interactions among the cell-types, and gene (protein) relations to 429 phenotypes. We are also planning to verify if known drug-induced phenotypes are reproducible in the networks using a Petri-net based simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114596708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Jeong, Dahee Lee, Namgi Han, Won Chul Kim, Min Song
The biomedical information extraction, especially Named Entity Recognition (NER), is a primary task in biomedical text-mining due to the rapid growth of large-scale literature. Extracting biomedical entities aims at identifying specific entities (words or phrases) from those unstructured text data. In this work, we introduce a novel biomedical NER system utilizing a combination of regional and global text features: linguistic, lexical, contextual, and syntactic features. Our system adopts Conditional Random Fields (CRFs) [1] as a machine learning algorithm and consists of two major pipelines (see Figure 1). We especially focus on constructing the first pipeline for text processing in a modularized manner and discovering rich feature sets regarding comprehensive linguistics and contexts. To implement the CRF framework in the second pipeline, our system uses a modified version of Mallet [2] to take advantage of feature induction. As a result of 10-fold cross-validation, our system achieves from 0.99% up to 18.47% of F-measure improvement as well as the highest precision compared to existing open-source biomedical NER systems on GENETAG corpus [3]. We figure out that several components such as abundant key features, external resources, and feature induction contribute to the performance of the proposed system.
由于大规模文献的快速增长,生物医学信息提取,特别是命名实体识别(NER)成为生物医学文本挖掘的首要任务。提取生物医学实体的目的是从这些非结构化文本数据中识别特定实体(单词或短语)。在这项工作中,我们引入了一个新的生物医学NER系统,该系统利用了区域和全局文本特征的组合:语言、词汇、上下文和句法特征。我们的系统采用条件随机场(Conditional Random Fields, CRFs)[1]作为机器学习算法,由两个主要管道组成(见图1)。我们特别关注以模块化方式构建文本处理的第一个管道,并发现关于综合语言学和上下文的丰富特征集。为了在第二个管道中实现CRF框架,我们的系统使用了修改版本的Mallet[2]来利用特征归纳。经过10倍交叉验证,与GENETAG语料库上现有的开源生物医学NER系统相比,我们的系统达到了0.99%到18.47%的F-measure改进,并且精度最高[3]。我们发现,丰富的关键特征、外部资源和特征归纳等因素对系统的性能有很大的影响。
{"title":"Biomedical Named Entity Recognition Based on the Combination of Regional and Global Text Features","authors":"Y. Jeong, Dahee Lee, Namgi Han, Won Chul Kim, Min Song","doi":"10.1145/2665970.2665990","DOIUrl":"https://doi.org/10.1145/2665970.2665990","url":null,"abstract":"The biomedical information extraction, especially Named Entity Recognition (NER), is a primary task in biomedical text-mining due to the rapid growth of large-scale literature. Extracting biomedical entities aims at identifying specific entities (words or phrases) from those unstructured text data. In this work, we introduce a novel biomedical NER system utilizing a combination of regional and global text features: linguistic, lexical, contextual, and syntactic features. Our system adopts Conditional Random Fields (CRFs) [1] as a machine learning algorithm and consists of two major pipelines (see Figure 1). We especially focus on constructing the first pipeline for text processing in a modularized manner and discovering rich feature sets regarding comprehensive linguistics and contexts. To implement the CRF framework in the second pipeline, our system uses a modified version of Mallet [2] to take advantage of feature induction. As a result of 10-fold cross-validation, our system achieves from 0.99% up to 18.47% of F-measure improvement as well as the highest precision compared to existing open-source biomedical NER systems on GENETAG corpus [3]. We figure out that several components such as abundant key features, external resources, and feature induction contribute to the performance of the proposed system.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128755255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yonghui Wu, J. Denny, S. Rosenbloom, R. Miller, D. Giuse, Min Song, Hua Xu
To save time, healthcare providers frequently use abbreviations while authoring clinical documents. Nevertheless, abbreviations that authors deem unambiguous often confuse other readers, including clinicians, patients, and natural language processing (NLP) systems. Most current clinical NLP systems "post-process" notes long after clinicians enter them into electronic health record systems (EHRs). Such post-processing cannot guarantee 100% accuracy in abbreviation identification and disambiguation, since multiple alternative interpretations exist. In this paper, authors describe a prototype system for real-time Clinical Abbreviation Recognition and Disambiguation (CARD) -- i.e., a system that interacts with authors during note generation to verify correct abbreviation senses. The CARD system design anticipates future integration with web-based clinical documentation systems to improve quality of healthcare records. The prototype application embodies three word sense disambiguation (WSD) methods. We evaluated the accuracy and response times of the prototype CARD system in a simulated study. Using an existing test data set of 25 commonly observed, highly ambiguous clinical abbreviations the evaluation demonstrated that the best WSD method had an accuracy of 88.8%, and a reasonable average response time of 1.6 milliseconds per each abbreviation. The study indicates potential feasibility of real-time NLP-enabled abbreviation disambiguation within clinical documentation systems.
{"title":"A prototype application for real-time recognition and disambiguation of clinical abbreviations","authors":"Yonghui Wu, J. Denny, S. Rosenbloom, R. Miller, D. Giuse, Min Song, Hua Xu","doi":"10.1145/2512089.2512096","DOIUrl":"https://doi.org/10.1145/2512089.2512096","url":null,"abstract":"To save time, healthcare providers frequently use abbreviations while authoring clinical documents. Nevertheless, abbreviations that authors deem unambiguous often confuse other readers, including clinicians, patients, and natural language processing (NLP) systems. Most current clinical NLP systems \"post-process\" notes long after clinicians enter them into electronic health record systems (EHRs). Such post-processing cannot guarantee 100% accuracy in abbreviation identification and disambiguation, since multiple alternative interpretations exist. In this paper, authors describe a prototype system for real-time Clinical Abbreviation Recognition and Disambiguation (CARD) -- i.e., a system that interacts with authors during note generation to verify correct abbreviation senses. The CARD system design anticipates future integration with web-based clinical documentation systems to improve quality of healthcare records. The prototype application embodies three word sense disambiguation (WSD) methods. We evaluated the accuracy and response times of the prototype CARD system in a simulated study. Using an existing test data set of 25 commonly observed, highly ambiguous clinical abbreviations the evaluation demonstrated that the best WSD method had an accuracy of 88.8%, and a reasonable average response time of 1.6 milliseconds per each abbreviation. The study indicates potential feasibility of real-time NLP-enabled abbreviation disambiguation within clinical documentation systems.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130790124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}