Vector space methods that measure semantic similarity and relatedness often rely on distributional information such as co–occurrence frequencies or statistical measures of association to weight the importance of particular co–occurrences. In this paper, we extend these methods by incorporating a measure of semantic similarity based on a human curated taxonomy into a second–order vector representation. This results in a measure of semantic relatedness that combines both the contextual information available in a corpus–based vector space representation with the semantic knowledge found in a biomedical ontology. Our results show that incorporating semantic similarity into a second order co-occurrence matrices improves correlation with human judgments for both similarity and relatedness, and that our method compares favorably to various different word embedding methods that have recently been evaluated on the same reference standards we have used.
{"title":"Improving Correlation with Human Judgments by Integrating Semantic Similarity with Second–Order Vectors","authors":"Bridget T. McInnes, Ted Pedersen","doi":"10.18653/v1/W17-2313","DOIUrl":"https://doi.org/10.18653/v1/W17-2313","url":null,"abstract":"Vector space methods that measure semantic similarity and relatedness often rely on distributional information such as co–occurrence frequencies or statistical measures of association to weight the importance of particular co–occurrences. In this paper, we extend these methods by incorporating a measure of semantic similarity based on a human curated taxonomy into a second–order vector representation. This results in a measure of semantic relatedness that combines both the contextual information available in a corpus–based vector space representation with the semantic knowledge found in a biomedical ontology. Our results show that incorporating semantic similarity into a second order co-occurrence matrices improves correlation with human judgments for both similarity and relatedness, and that our method compares favorably to various different word embedding methods that have recently been evaluated on the same reference standards we have used.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133906502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, P. Bessières, C. Nédellec
This paper presents the Bacteria Biotope task of the BioNLP Shared Task 2016, which follows the previous 2013 and 2011 editions. The task focuses on the extraction of the locations (biotopes and geographical places) of bacteria from PubMe abstracts and the characterization of bacteria and their associated habitats with respect to reference knowledge sources (NCBI taxonomy, OntoBiotope ontology). The task is motivated by the importance of the knowledge on bacteria habitats for fundamental research and applications in microbiology. The paper describes the different proposed subtasks, the corpus characteristics, the challenge organization, and the evaluation metrics. We also provide an analysis of the results obtained by participants.
{"title":"Overview of the Bacteria Biotope Task at BioNLP Shared Task 2016","authors":"Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, P. Bessières, C. Nédellec","doi":"10.18653/v1/W16-3002","DOIUrl":"https://doi.org/10.18653/v1/W16-3002","url":null,"abstract":"This paper presents the Bacteria Biotope task of the BioNLP Shared Task 2016, which follows the previous 2013 and 2011 editions. The task focuses on the extraction of the locations (biotopes and geographical places) of bacteria from PubMe abstracts and the characterization of bacteria and their associated habitats with\u0000respect to reference knowledge sources (NCBI taxonomy, OntoBiotope ontology). The task is motivated by the importance of the knowledge on bacteria habitats for fundamental research and applications in microbiology. The paper describes the different proposed subtasks, the corpus characteristics, the challenge organization, and the evaluation metrics. We also provide an analysis of the results obtained by participants.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130802174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Estelle Chaix, B. Dubreucq, Abdelhak Fatihi, Dialekti Valsamou, Robert Bossy, Mouhamadou Ba, Louise Deléger, Pierre Zweigenbaum, P. Bessières, L. Lepiniec, C. Nédellec
This paper presents the SeeDev Task of the BioNLP Shared Task 2016. The purpose of the SeeDev Task is the extraction from scientific articles of the descriptions of genetic and molecular mechanisms involved in seed development of the model plant, Arabidopsis thaliana. The SeeDev task consists in the extraction of many different event types that involve a wide range of entity types so that they accurately reflect the complexity of the biological mechanisms. The corpus is composed of paragraphs selected from the full-texts of relevant scientific articles. In this paper, we describe the organization of the SeeDev task, the corpus characteristics, and the metrics used for the evaluation of participant systems. We analyze and discuss the final results of the seven participant systems to the test. The best F-score is 0.432, which is similar to the scores achieved in similar tasks on molecular biology.
{"title":"Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task 2016.","authors":"Estelle Chaix, B. Dubreucq, Abdelhak Fatihi, Dialekti Valsamou, Robert Bossy, Mouhamadou Ba, Louise Deléger, Pierre Zweigenbaum, P. Bessières, L. Lepiniec, C. Nédellec","doi":"10.18653/v1/W16-3001","DOIUrl":"https://doi.org/10.18653/v1/W16-3001","url":null,"abstract":"This paper presents the SeeDev Task of the BioNLP Shared Task 2016. The purpose of the SeeDev Task is the extraction from scientific articles of the descriptions of genetic and molecular mechanisms involved in seed development of the model plant, Arabidopsis thaliana. The SeeDev task consists in the extraction of many different event types that involve a wide range of entity types so that they accurately reflect the complexity of the biological mechanisms. The corpus is composed of paragraphs selected from the full-texts of relevant scientific articles. In this paper, we describe the organization of the SeeDev task, the corpus characteristics, and the metrics used for the evaluation of participant systems. We analyze and discuss the final results of the seven participant systems to the test. The best F-score is 0.432, which is similar to the scores achieved in similar tasks on molecular biology.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114774770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents our participation in the Bacteria/Biotope track from the 2016 BioNLP Shared-Task. Our methods rely on a combination of distinct machinelearning and rule-based systems. We used CRF and post-processing rules to identify mentions of bacteria and biotopes, a rulebased approach to normalize the concepts in the ontology and the taxonomy, and SVM to identify relations between bacteria and biotopes. On the test datasets, we achieved similar results to those obtained on the development datasets: on the categorization task, precision of 0.503 (gold standard entities) and SER of 0.827 (both NER and categorization); on the event relation task, F-measure of 0.485 (gold standard entities, ranking third out of 11) and of 0.192 (both NER and event relation, ranking first); on the knowledgebased task, mean references of 0.771 (gold standard entities) and of 0.202 (both NER, categorization and event relation).
{"title":"Identification of Mentions and Relations between Bacteria and Biotope from PubMed Abstracts","authors":"Cyril Grouin","doi":"10.18653/v1/W16-3008","DOIUrl":"https://doi.org/10.18653/v1/W16-3008","url":null,"abstract":"This paper presents our participation in the Bacteria/Biotope track from the 2016 BioNLP Shared-Task. Our methods rely on a combination of distinct machinelearning and rule-based systems. We used CRF and post-processing rules to identify mentions of bacteria and biotopes, a rulebased approach to normalize the concepts in the ontology and the taxonomy, and SVM to identify relations between bacteria and biotopes. On the test datasets, we achieved similar results to those obtained on the development datasets: on the categorization task, precision of 0.503 (gold standard entities) and SER of 0.827 (both NER and categorization); on the event relation task, F-measure of 0.485 (gold standard entities, ranking third out of 11) and of 0.192 (both NER and event relation, ranking first); on the knowledgebased task, mean references of 0.771 (gold standard entities) and of 0.202 (both NER, categorization and event relation).","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129448417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mert Tiftikci, H. Sahin, Berfu Büyüköz, Alper Yayikçi, Arzucan Özgür
A database which provides information about bacteria and their habitats in a comprehensive and normalized way is crucial for applied microbiology studies. Having this information spread through textual resources such as scientific articles and web pages leads to a need for automatically detecting bacteria and habitat entities in text, semantically tagging them using ontologies, and finally extracting the events among them. These are the challenges set forth by the Bacteria Biotopes Task of the BioNLP Shared Task 2016. This paper describes a system for habitat and bacteria entity normalization through the OntoBiotope ontology and the NCBI taxonomy, respectively. The system, which obtained promising results on the shared task data set, utilizes basic information retrieval techniques.
{"title":"Ontology-Based Categorization of Bacteria and Habitat Entities using Information Retrieval Techniques","authors":"Mert Tiftikci, H. Sahin, Berfu Büyüköz, Alper Yayikçi, Arzucan Özgür","doi":"10.18653/v1/w16-3007","DOIUrl":"https://doi.org/10.18653/v1/w16-3007","url":null,"abstract":"A database which provides information about bacteria and their habitats in a comprehensive and normalized way is crucial for applied microbiology studies. Having this information spread through textual resources such as scientific articles and web pages leads to a need for automatically detecting bacteria and habitat entities in text, semantically tagging them using ontologies, and finally extracting the events among them. These are the challenges set forth by the Bacteria Biotopes Task of the BioNLP Shared Task 2016. This paper describes a system for habitat and bacteria entity normalization through the OntoBiotope ontology and the NCBI taxonomy, respectively. The system, which obtained promising results on the shared task data set, utilizes basic information retrieval techniques.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121726401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Honglei Li, Jianhai Zhang, Jian Wang, Hongfei Lin, Zhihao Yang
We participate in the two event extraction tasks of BioNLP 2016 Shared Task: binary relation extraction of SeeDev task and localization relations extraction of Bacteria Biotope task. Convolutional neural network (CNN) is employed to model the sentences by convolution and maxpooling operation from raw input with word embedding. Then, full connected neural network is used to learn senior and significant features automatically. The proposed model mainly contains two modules: distributive semantic representation building, such as word embedding, POS embedding, distance embedding and entity type embedding, and CNN model training. The results with F-score of 0.370 and 0.478 in our participant tasks, which were evaluated on the test data set, show that our proposed method contributes to binary relation extraction effectively and can reduce the impact of artificial feature engineering through automatically feature learning.
{"title":"DUTIR in BioNLP-ST 2016: Utilizing Convolutional Network and Distributed Representation to Extract Complicate Relations","authors":"Honglei Li, Jianhai Zhang, Jian Wang, Hongfei Lin, Zhihao Yang","doi":"10.18653/v1/W16-3012","DOIUrl":"https://doi.org/10.18653/v1/W16-3012","url":null,"abstract":"We participate in the two event extraction tasks of BioNLP 2016 Shared Task: binary relation extraction of SeeDev task and localization relations extraction of Bacteria Biotope task. Convolutional neural network (CNN) is employed to model the sentences by convolution and maxpooling operation from raw input with word embedding. Then, full connected neural network is used to learn senior and significant features automatically. The proposed model mainly contains two modules: distributive semantic representation building, such as word embedding, POS embedding, distance embedding and entity type embedding, and CNN model training. The results with F-score of 0.370 and 0.478 in our participant tasks, which were evaluated on the test data set, show that our proposed method contributes to binary relation extraction effectively and can reduce the impact of artificial feature engineering through automatically feature learning.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122041818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The number of scientific papers published each year is growing exponentially and given the rate of this growth, automated information extraction is needed to efficiently extract information from this corpus. A critical first step in this process is to accurately recognize the names of entities in text. Previous efforts, such as SPECIES, have identified bacteria strain names, among other taxonomic groups, but have been limited to those names present in NCBI taxonomy. We have implemented a dictionary-based named entity tagger, TagIt, that is followed by a rule based expansion system to identify bacteria strain names and habitats and resolve them to the closest match possible in the NCBI taxonomy and the OntoBiotope ontology respectively. The rule based post processing steps expand acronyms, and extend strain names according to a set of rules, which captures additional aliases and strains that are not present in the dictionary. TagIt has the best performance out of three entries to BioNLP-ST BB3 cat+ner, with an overall SER of 0.628 on the independent test set.
{"title":"A dictionary- and rule-based system for identification of bacteria and habitats in text","authors":"H. Cook, E. Pafilis, L. Jensen","doi":"10.18653/v1/W16-3006","DOIUrl":"https://doi.org/10.18653/v1/W16-3006","url":null,"abstract":"The number of scientific papers published each year is growing exponentially and given the rate of this growth, automated information extraction is needed to efficiently extract information from this corpus. A critical first step in this process is to accurately recognize the names of entities in text. Previous efforts, such as SPECIES, have identified bacteria strain names, among other taxonomic groups, but have been limited to those names present in NCBI taxonomy. We have implemented a dictionary-based named entity tagger, TagIt, that is followed by a rule based expansion system to identify bacteria strain names and habitats and resolve them to the closest match possible in the NCBI taxonomy and the OntoBiotope ontology respectively. The rule based post processing steps expand acronyms, and extend strain names according to a set of rules, which captures additional aliases and strains that are not present in the dictionary. TagIt has the best performance out of three entries to BioNLP-ST BB3 cat+ner, with an overall SER of 0.628 on the independent test set.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126734862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Even a simple biological phenomenon may introduce a complex network of molecular interactions. Scientific literature is one of the trustful resources delivering knowledge of these networks. We propose LitWay, a system for extracting semantic relations from texts. LitWay utilizes a hybrid method that combines both a rule-based method and a machine learning-based method. It is tested on the SeeDev task of BioNLP-ST 2016, achieves the state-of-the-art performance with the F-score of 43.2%, ranking first of all participating teams. To further reveal the linguistic characteristics of each event, we test the system solely with syntactic rules or machine learning, and different combinations of two methods. We find that it is difficult for one method to achieve good performance for all semantic relation types due to the complication of bio-events in the literatures.
{"title":"LitWay, Discriminative Extraction for Different Bio-Events","authors":"Chen Li, Zhiqiang Rao, Xiangrong Zhang","doi":"10.18653/v1/W16-3004","DOIUrl":"https://doi.org/10.18653/v1/W16-3004","url":null,"abstract":"Even a simple biological phenomenon may introduce a complex network of molecular interactions. Scientific literature is one of the trustful resources delivering knowledge of these networks. We propose LitWay, a system for extracting semantic relations from texts. LitWay utilizes a hybrid method that combines both a rule-based method and a machine learning-based method. It is tested on the SeeDev task of BioNLP-ST 2016, achieves the state-of-the-art performance with the F-score of 43.2%, ranking first of all participating teams. To further reveal the linguistic characteristics of each event, we test the system solely with syntactic rules or machine learning, and different combinations of two methods. We find that it is difficult for one method to achieve good performance for all semantic relation types due to the complication of bio-events in the literatures.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123445857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Through advanced technologies in clinical care and research, especially the rapid progress in imaging technologies, more and more medical imaging data and patient text data is generated by hospitals, pharmaceutical companies, and medical research. For enabling advanced access to clinical imaging and text data, it is relevant to know what kind of knowledge the clinician wants to know or the queries that clinicians are interested in. Through intensive interviews and discussions with radiologists and clinicians, we have learned that medical imaging data is analyzed - and hence queried -- from three different perspectives, i.e. the anatomic perspective addressing the involved body parts, the radiology-specific spatial perspective describing the relationships of located anatomical regions to other anatomical parts, and the disease perspective distinguishing between normal and abnormal imaging features. Our aim is to establish query patterns reflecting those three perspectives that would typically be used by clinicians and radiologists to find patient-specific sets of relevant images.
{"title":"Statistical Term Profiling for Query Pattern Mining","authors":"P. Buitelaar, P. Wennerberg, S. Zillner","doi":"10.3115/1572306.1572336","DOIUrl":"https://doi.org/10.3115/1572306.1572336","url":null,"abstract":"Through advanced technologies in clinical care and research, especially the rapid progress in imaging technologies, more and more medical imaging data and patient text data is generated by hospitals, pharmaceutical companies, and medical research. For enabling advanced access to clinical imaging and text data, it is relevant to know what kind of knowledge the clinician wants to know or the queries that clinicians are interested in. Through intensive interviews and discussions with radiologists and clinicians, we have learned that medical imaging data is analyzed - and hence queried -- from three different perspectives, i.e. the anatomic perspective addressing the involved body parts, the radiology-specific spatial perspective describing the relationships of located anatomical regions to other anatomical parts, and the disease perspective distinguishing between normal and abnormal imaging features. Our aim is to establish query patterns reflecting those three perspectives that would typically be used by clinicians and radiologists to find patient-specific sets of relevant images.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128804972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper is focused on determining which proteins affect the activity of Aryl Hydrocarbon Receptor (AHR) system when learning a model that can accurately predict its activity when single genes are knocked out. Experiments with results are presented when models are trained on a single source of information: abstracts from Medline (http://medline.cos.com/) that talk about the genes involved in the experiments. The results suggest that AdaBoost classifier with a binary bag-of-words representation obtains significantly better results.
{"title":"Textual Information for Predicting Functional Properties of the Genes","authors":"Oana Frunza, D. Inkpen","doi":"10.3115/1572306.1572334","DOIUrl":"https://doi.org/10.3115/1572306.1572334","url":null,"abstract":"This paper is focused on determining which proteins affect the activity of Aryl Hydrocarbon Receptor (AHR) system when learning a model that can accurately predict its activity when single genes are knocked out. Experiments with results are presented when models are trained on a single source of information: abstracts from Medline (http://medline.cos.com/) that talk about the genes involved in the experiments. The results suggest that AdaBoost classifier with a binary bag-of-words representation obtains significantly better results.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121453608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}