In this paper I observe a number of new plural and (apparently) quantified examples of free indirect discourse (FID) and protagonist projection (PP). I analyse them within major current theoretical approaches, proposing extensions to these approaches where needed. In order to derive the wide range of readings observed with plural protagonists, I show how we can exploit existing mechanisms for the interpretation of plural anaphora and plural predication. The upshot is that the interpretation of plural examples of perspective shift relies on a remarkable concert of covert semantic and pragmatic operations.
{"title":"Plural and Quantified Protagonists in Free Indirect Discourse and Protagonist Projection","authors":"Márta Abrusán","doi":"10.1093/jos/ffad004","DOIUrl":"https://doi.org/10.1093/jos/ffad004","url":null,"abstract":"\u0000 In this paper I observe a number of new plural and (apparently) quantified examples of free indirect discourse (FID) and protagonist projection (PP). I analyse them within major current theoretical approaches, proposing extensions to these approaches where needed. In order to derive the wide range of readings observed with plural protagonists, I show how we can exploit existing mechanisms for the interpretation of plural anaphora and plural predication. The upshot is that the interpretation of plural examples of perspective shift relies on a remarkable concert of covert semantic and pragmatic operations.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"18 1","pages":"127-151"},"PeriodicalIF":1.9,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77176856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Indicative conditionals and configurations with neg-raising predicates have been brought up as potential candidates for constructions involving world pluralities. I argue against this hypothesis, showing that cumulativity and quantifiers targeting a plurality’s part structure cannot access the presumed world pluralities. I furthermore argue that this makes worlds special in the sense that the same tests provide evidence for pluralities in various other semantic domains.
{"title":"Are There Pluralities of Worlds?","authors":"V. Schmitt","doi":"10.1093/jos/ffad002","DOIUrl":"https://doi.org/10.1093/jos/ffad002","url":null,"abstract":"\u0000 Indicative conditionals and configurations with neg-raising predicates have been brought up as potential candidates for constructions involving world pluralities. I argue against this hypothesis, showing that cumulativity and quantifiers targeting a plurality’s part structure cannot access the presumed world pluralities. I furthermore argue that this makes worlds special in the sense that the same tests provide evidence for pluralities in various other semantic domains.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"7 1","pages":"153-178"},"PeriodicalIF":1.9,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81520300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Copredication occurs when a sentence receives a true reading despite prima facie ascribing categorically incompatible properties to a single entity. For example, ‘The red book is by Tolstoy’ can have a true reading even though it seems that being red is only a property of physical copies, while being by Tolstoy is only a property of informational texts. A tempting strategy for resolving this tension is to claim that at least one of the predicates has a non-standard interpretation, with the salient proposal involving reinterpretation via meaning transfer. For example, in ‘The red book is by Tolstoy’, one could hold that the predicate ‘by Tolstoy’ is reinterpreted (or on the more specific proposal, transferred) to ascribe a property that physical copies can uncontroversially instantiate, such as expresses an informational text by Tolstoy. On this view, the truth of the copredicational sentence is no longer mysterious. Furthermore, such a reinterpretation view can give a straightforward account of a range of puzzling copredicational sentences involving counting an individuation. Despite these substantial virtues, we will argue that reinterpretation approaches to copredication are untenable. In §1 we introduce reinterpretation views of copredication and contrast them with key alternatives. In §2 we argue against a general reinterpretation theory of copredication on which every copredicational sentence contains at least one reinterpreted predicate. We also raise additional problems for the more specific proposal of implementing reinterpretation via meaning transfer. In §3 we argue against more limited appeals to reinterpretation on which only some copredicational sentences contain reinterpretation. In §4 we criticize a series of arguments in favour of reinterpretation theories. The upshot is that reinterpretation theories of copredication, and in particular, meaning transfer-based accounts, should be rejected.
{"title":"Copredication and Meaning Transfer","authors":"David Liebesman, Ofra Magidor","doi":"10.1093/jos/ffad001","DOIUrl":"https://doi.org/10.1093/jos/ffad001","url":null,"abstract":"\u0000 Copredication occurs when a sentence receives a true reading despite prima facie ascribing categorically incompatible properties to a single entity. For example, ‘The red book is by Tolstoy’ can have a true reading even though it seems that being red is only a property of physical copies, while being by Tolstoy is only a property of informational texts.\u0000 A tempting strategy for resolving this tension is to claim that at least one of the predicates has a non-standard interpretation, with the salient proposal involving reinterpretation via meaning transfer. For example, in ‘The red book is by Tolstoy’, one could hold that the predicate ‘by Tolstoy’ is reinterpreted (or on the more specific proposal, transferred) to ascribe a property that physical copies can uncontroversially instantiate, such as expresses an informational text by Tolstoy. On this view, the truth of the copredicational sentence is no longer mysterious. Furthermore, such a reinterpretation view can give a straightforward account of a range of puzzling copredicational sentences involving counting an individuation.\u0000 Despite these substantial virtues, we will argue that reinterpretation approaches to copredication are untenable. In §1 we introduce reinterpretation views of copredication and contrast them with key alternatives. In §2 we argue against a general reinterpretation theory of copredication on which every copredicational sentence contains at least one reinterpreted predicate. We also raise additional problems for the more specific proposal of implementing reinterpretation via meaning transfer. In §3 we argue against more limited appeals to reinterpretation on which only some copredicational sentences contain reinterpretation. In §4 we criticize a series of arguments in favour of reinterpretation theories. The upshot is that reinterpretation theories of copredication, and in particular, meaning transfer-based accounts, should be rejected.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"3 1","pages":"69-91"},"PeriodicalIF":1.9,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81558248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-24DOI: 10.1186/s13326-023-00283-x
Lauren E Chan, Anne E Thessen, William D Duncan, Nicolas Matentzoglu, Charles Schmitt, Cynthia J Grondin, Nicole Vasilevsky, Julie A McMurry, Peter N Robinson, Christopher J Mungall, Melissa A Haendel
Background: Evaluating the impact of environmental exposures on organism health is a key goal of modern biomedicine and is critically important in an age of greater pollution and chemicals in our environment. Environmental health utilizes many different research methods and generates a variety of data types. However, to date, no comprehensive database represents the full spectrum of environmental health data. Due to a lack of interoperability between databases, tools for integrating these resources are needed. In this manuscript we present the Environmental Conditions, Treatments, and Exposures Ontology (ECTO), a species-agnostic ontology focused on exposure events that occur as a result of natural and experimental processes, such as diet, work, or research activities. ECTO is intended for use in harmonizing environmental health data resources to support cross-study integration and inference for mechanism discovery.
Methods and findings: ECTO is an ontology designed for describing organismal exposures such as toxicological research, environmental variables, dietary features, and patient-reported data from surveys. ECTO utilizes the base model established within the Exposure Ontology (ExO). ECTO is developed using a combination of manual curation and Dead Simple OWL Design Patterns (DOSDP), and contains over 2700 environmental exposure terms, and incorporates chemical and environmental ontologies. ECTO is an Open Biological and Biomedical Ontology (OBO) Foundry ontology that is designed for interoperability, reuse, and axiomatization with other ontologies. ECTO terms have been utilized in axioms within the Mondo Disease Ontology to represent diseases caused or influenced by environmental factors, as well as for survey encoding for the Personalized Environment and Genes Study (PEGS).
Conclusions: We constructed ECTO to meet Open Biological and Biomedical Ontology (OBO) Foundry principles to increase translation opportunities between environmental health and other areas of biology. ECTO has a growing community of contributors consisting of toxicologists, public health epidemiologists, and health care providers to provide the necessary expertise for areas that have been identified previously as gaps.
{"title":"The Environmental Conditions, Treatments, and Exposures Ontology (ECTO): connecting toxicology and exposure to human health and beyond.","authors":"Lauren E Chan, Anne E Thessen, William D Duncan, Nicolas Matentzoglu, Charles Schmitt, Cynthia J Grondin, Nicole Vasilevsky, Julie A McMurry, Peter N Robinson, Christopher J Mungall, Melissa A Haendel","doi":"10.1186/s13326-023-00283-x","DOIUrl":"10.1186/s13326-023-00283-x","url":null,"abstract":"<p><strong>Background: </strong>Evaluating the impact of environmental exposures on organism health is a key goal of modern biomedicine and is critically important in an age of greater pollution and chemicals in our environment. Environmental health utilizes many different research methods and generates a variety of data types. However, to date, no comprehensive database represents the full spectrum of environmental health data. Due to a lack of interoperability between databases, tools for integrating these resources are needed. In this manuscript we present the Environmental Conditions, Treatments, and Exposures Ontology (ECTO), a species-agnostic ontology focused on exposure events that occur as a result of natural and experimental processes, such as diet, work, or research activities. ECTO is intended for use in harmonizing environmental health data resources to support cross-study integration and inference for mechanism discovery.</p><p><strong>Methods and findings: </strong>ECTO is an ontology designed for describing organismal exposures such as toxicological research, environmental variables, dietary features, and patient-reported data from surveys. ECTO utilizes the base model established within the Exposure Ontology (ExO). ECTO is developed using a combination of manual curation and Dead Simple OWL Design Patterns (DOSDP), and contains over 2700 environmental exposure terms, and incorporates chemical and environmental ontologies. ECTO is an Open Biological and Biomedical Ontology (OBO) Foundry ontology that is designed for interoperability, reuse, and axiomatization with other ontologies. ECTO terms have been utilized in axioms within the Mondo Disease Ontology to represent diseases caused or influenced by environmental factors, as well as for survey encoding for the Personalized Environment and Genes Study (PEGS).</p><p><strong>Conclusions: </strong>We constructed ECTO to meet Open Biological and Biomedical Ontology (OBO) Foundry principles to increase translation opportunities between environmental health and other areas of biology. ECTO has a growing community of contributors consisting of toxicologists, public health epidemiologists, and health care providers to provide the necessary expertise for areas that have been identified previously as gaps.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"3"},"PeriodicalIF":1.6,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9951428/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9257159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Negative Polarity Items (NPIs) with emphatic prosody such as ANY or EVER, and minimizers such as lift a finger or sleep a wink are known to generate particular contextual inferences that are absent in the case of non-emphatic NPIs such as unstressed any or ever. It remains an open question, however, what the exact status of these inferences is and how they come about. In this paper, we analyze these cases as NPIs bearing focus, and examine the interaction between focus semantics and the lexical semantics of NPIs across statements and questions. In the process, we refine and expand the empirical landscape by demonstrating that focused NPIs give rise to a variety of apparently heterogeneous contextual inferences, including domain widening in statements and inferences of negative bias in questions. These inferences are further shown to be modulated in subtle ways depending on the specific clause-type in which the NPI occurs (e.g., polar questions vs. wh-questions) and the type of emphatic NPI involved (e.g., ANY vs. lift a finger). Building on these empirical observations, we propose a unified account of NPIs which posits a single core semantic operator, even, across both focused and unfocused NPIs. What plays a central role in our account is the additive component of even, which we formulate in such a way that it applies uniformly across statements and questions. This additive component of even, intuitively paraphrased as the implication that all salient focus alternatives of the prejacent of the operator must be settled in the doxastic state of the speaker, is selectively activated depending on the presence of focus alternatives, and is shown to be able to derive all the observed contextual inferences stemming from focused NPIs, both in statements and in questions.
{"title":"Focused NPIs in Statements and Questions","authors":"Sunwoo Jeong, F. Roelofsen","doi":"10.1093/jos/ffac014","DOIUrl":"https://doi.org/10.1093/jos/ffac014","url":null,"abstract":"\u0000 Negative Polarity Items (NPIs) with emphatic prosody such as ANY or EVER, and minimizers such as lift a finger or sleep a wink are known to generate particular contextual inferences that are absent in the case of non-emphatic NPIs such as unstressed any or ever. It remains an open question, however, what the exact status of these inferences is and how they come about. In this paper, we analyze these cases as NPIs bearing focus, and examine the interaction between focus semantics and the lexical semantics of NPIs across statements and questions. In the process, we refine and expand the empirical landscape by demonstrating that focused NPIs give rise to a variety of apparently heterogeneous contextual inferences, including domain widening in statements and inferences of negative bias in questions. These inferences are further shown to be modulated in subtle ways depending on the specific clause-type in which the NPI occurs (e.g., polar questions vs. wh-questions) and the type of emphatic NPI involved (e.g., ANY vs. lift a finger). Building on these empirical observations, we propose a unified account of NPIs which posits a single core semantic operator, even, across both focused and unfocused NPIs. What plays a central role in our account is the additive component of even, which we formulate in such a way that it applies uniformly across statements and questions. This additive component of even, intuitively paraphrased as the implication that all salient focus alternatives of the prejacent of the operator must be settled in the doxastic state of the speaker, is selectively activated depending on the presence of focus alternatives, and is shown to be able to derive all the observed contextual inferences stemming from focused NPIs, both in statements and in questions.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"34 1","pages":"1-68"},"PeriodicalIF":1.9,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81055982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-02DOI: 10.1186/s13326-022-00281-5
Leonardo Campillos-Llanos
Background: Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish.
Construction and content: This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System[Formula: see text] (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries.
Conclusions: The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.
{"title":"MedLexSp - a medical lexicon for Spanish medical natural language processing.","authors":"Leonardo Campillos-Llanos","doi":"10.1186/s13326-022-00281-5","DOIUrl":"10.1186/s13326-022-00281-5","url":null,"abstract":"<p><strong>Background: </strong>Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish.</p><p><strong>Construction and content: </strong>This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System[Formula: see text] (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries.</p><p><strong>Conclusions: </strong>The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"2"},"PeriodicalIF":1.9,"publicationDate":"2023-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9892682/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9619937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-31DOI: 10.1186/s13326-023-00282-y
Antonio Jose Jimeno Yepes, Karin Verspoor
<p><strong>Background: </strong>Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health.</p><p><strong>Objective: </strong>In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications.</p><p><strong>Methods: </strong>We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen.</p><p><strong>Results: </strong>We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents.</p><p><strong>Conclusions: </strong>We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisa
{"title":"Classifying literature mentions of biological pathogens as experimentally studied using natural language processing.","authors":"Antonio Jose Jimeno Yepes, Karin Verspoor","doi":"10.1186/s13326-023-00282-y","DOIUrl":"10.1186/s13326-023-00282-y","url":null,"abstract":"<p><strong>Background: </strong>Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health.</p><p><strong>Objective: </strong>In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications.</p><p><strong>Methods: </strong>We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen.</p><p><strong>Results: </strong>We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents.</p><p><strong>Conclusions: </strong>We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisa","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"1"},"PeriodicalIF":1.9,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9889128/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9243626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-27DOI: 10.1186/s13326-022-00280-6
Lisa Kühnel, Juliane Fluck
Background: Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize.
Results: Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data.
Conclusions: We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.
{"title":"We are not ready yet: limitations of state-of-the-art disease named entity recognizers.","authors":"Lisa Kühnel, Juliane Fluck","doi":"10.1186/s13326-022-00280-6","DOIUrl":"https://doi.org/10.1186/s13326-022-00280-6","url":null,"abstract":"<p><strong>Background: </strong>Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize.</p><p><strong>Results: </strong>Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data.</p><p><strong>Conclusions: </strong>We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":" ","pages":"26"},"PeriodicalIF":1.9,"publicationDate":"2022-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9612606/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40429097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-21DOI: 10.1186/s13326-022-00279-z
Yongqun He, Hong Yu, Anthony Huffman, Asiyah Yu Lin, Darren A Natale, John Beverley, Ling Zheng, Yehoshua Perl, Zhigang Wang, Yingtong Liu, Edison Ong, Yang Wang, Philip Huang, Long Tran, Jinyang Du, Zalan Shah, Easheta Shah, Roshan Desai, Hsin-Hui Huang, Yujia Tian, Eric Merrell, William D Duncan, Sivaram Arabandi, Lynn M Schriml, Jie Zheng, Anna Maria Masci, Liwei Wang, Hongfang Liu, Fatima Zohra Smaili, Robert Hoehndorf, Zoë May Pendlington, Paola Roncaglia, Xianwei Ye, Jiangan Xie, Yi-Wei Tang, Xiaolin Yang, Suyuan Peng, Luxia Zhang, Luonan Chen, Junguk Hur, Gilbert S Omenn, Brian Athey, Barry Smith
Background: The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020.
Results: As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment.
Conclusion: CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications.
{"title":"A comprehensive update on CIDO: the community-based coronavirus infectious disease ontology.","authors":"Yongqun He, Hong Yu, Anthony Huffman, Asiyah Yu Lin, Darren A Natale, John Beverley, Ling Zheng, Yehoshua Perl, Zhigang Wang, Yingtong Liu, Edison Ong, Yang Wang, Philip Huang, Long Tran, Jinyang Du, Zalan Shah, Easheta Shah, Roshan Desai, Hsin-Hui Huang, Yujia Tian, Eric Merrell, William D Duncan, Sivaram Arabandi, Lynn M Schriml, Jie Zheng, Anna Maria Masci, Liwei Wang, Hongfang Liu, Fatima Zohra Smaili, Robert Hoehndorf, Zoë May Pendlington, Paola Roncaglia, Xianwei Ye, Jiangan Xie, Yi-Wei Tang, Xiaolin Yang, Suyuan Peng, Luxia Zhang, Luonan Chen, Junguk Hur, Gilbert S Omenn, Brian Athey, Barry Smith","doi":"10.1186/s13326-022-00279-z","DOIUrl":"10.1186/s13326-022-00279-z","url":null,"abstract":"<p><strong>Background: </strong>The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020.</p><p><strong>Results: </strong>As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment.</p><p><strong>Conclusion: </strong>CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"13 1","pages":"25"},"PeriodicalIF":2.0,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9585694/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9587760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-18DOI: 10.1186/s13326-022-00278-0
Benedikt Fh Becker, Jan A Kors, Erik M van Mulligen, Miriam Cjm Sturkenboom
Background: Vaccine information in European electronic health record (EHR) databases is represented using various clinical and database-specific coding systems and drug vocabularies. The lack of harmonization constitutes a challenge in reusing EHR data in collaborative benefit-risk studies about vaccines.
Methods: We designed an ontology of the properties that are commonly used in vaccine descriptions, called Ontology of Vaccine Descriptions (VaccO), with a dictionary for the analysis of multilingual vaccine descriptions. We implemented five algorithms for the alignment of vaccine coding systems, i.e., the identification of corresponding codes from different coding ystems, based on an analysis of the code descriptors. The algorithms were evaluated by comparing their results with manually created alignments in two reference sets including clinical and database-specific coding systems with multilingual code descriptors.
Results: The best-performing algorithm represented code descriptors as logical statements about entities in the VaccO ontology and used an ontology reasoner to infer common properties and identify corresponding vaccine codes. The evaluation demonstrated excellent performance of the approach (F-scores 0.91 and 0.96).
Conclusion: The VaccO ontology allows the identification, representation, and comparison of heterogeneous descriptions of vaccines. The automatic alignment of vaccine coding systems can accelerate the readiness of EHR databases in collaborative vaccine studies.
{"title":"Alignment of vaccine codes using an ontology of vaccine descriptions.","authors":"Benedikt Fh Becker, Jan A Kors, Erik M van Mulligen, Miriam Cjm Sturkenboom","doi":"10.1186/s13326-022-00278-0","DOIUrl":"https://doi.org/10.1186/s13326-022-00278-0","url":null,"abstract":"<p><strong>Background: </strong>Vaccine information in European electronic health record (EHR) databases is represented using various clinical and database-specific coding systems and drug vocabularies. The lack of harmonization constitutes a challenge in reusing EHR data in collaborative benefit-risk studies about vaccines.</p><p><strong>Methods: </strong>We designed an ontology of the properties that are commonly used in vaccine descriptions, called Ontology of Vaccine Descriptions (VaccO), with a dictionary for the analysis of multilingual vaccine descriptions. We implemented five algorithms for the alignment of vaccine coding systems, i.e., the identification of corresponding codes from different coding ystems, based on an analysis of the code descriptors. The algorithms were evaluated by comparing their results with manually created alignments in two reference sets including clinical and database-specific coding systems with multilingual code descriptors.</p><p><strong>Results: </strong>The best-performing algorithm represented code descriptors as logical statements about entities in the VaccO ontology and used an ontology reasoner to infer common properties and identify corresponding vaccine codes. The evaluation demonstrated excellent performance of the approach (F-scores 0.91 and 0.96).</p><p><strong>Conclusion: </strong>The VaccO ontology allows the identification, representation, and comparison of heterogeneous descriptions of vaccines. The automatic alignment of vaccine coding systems can accelerate the readiness of EHR databases in collaborative vaccine studies.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":" ","pages":"24"},"PeriodicalIF":1.9,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580193/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40339107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}