Negative Polarity Items (NPIs) with emphatic prosody such as ANY or EVER, and minimizers such as lift a finger or sleep a wink are known to generate particular contextual inferences that are absent in the case of non-emphatic NPIs such as unstressed any or ever. It remains an open question, however, what the exact status of these inferences is and how they come about. In this paper, we analyze these cases as NPIs bearing focus, and examine the interaction between focus semantics and the lexical semantics of NPIs across statements and questions. In the process, we refine and expand the empirical landscape by demonstrating that focused NPIs give rise to a variety of apparently heterogeneous contextual inferences, including domain widening in statements and inferences of negative bias in questions. These inferences are further shown to be modulated in subtle ways depending on the specific clause-type in which the NPI occurs (e.g., polar questions vs. wh-questions) and the type of emphatic NPI involved (e.g., ANY vs. lift a finger). Building on these empirical observations, we propose a unified account of NPIs which posits a single core semantic operator, even, across both focused and unfocused NPIs. What plays a central role in our account is the additive component of even, which we formulate in such a way that it applies uniformly across statements and questions. This additive component of even, intuitively paraphrased as the implication that all salient focus alternatives of the prejacent of the operator must be settled in the doxastic state of the speaker, is selectively activated depending on the presence of focus alternatives, and is shown to be able to derive all the observed contextual inferences stemming from focused NPIs, both in statements and in questions.
{"title":"Focused NPIs in Statements and Questions","authors":"Sunwoo Jeong, F. Roelofsen","doi":"10.1093/jos/ffac014","DOIUrl":"https://doi.org/10.1093/jos/ffac014","url":null,"abstract":"\u0000 Negative Polarity Items (NPIs) with emphatic prosody such as ANY or EVER, and minimizers such as lift a finger or sleep a wink are known to generate particular contextual inferences that are absent in the case of non-emphatic NPIs such as unstressed any or ever. It remains an open question, however, what the exact status of these inferences is and how they come about. In this paper, we analyze these cases as NPIs bearing focus, and examine the interaction between focus semantics and the lexical semantics of NPIs across statements and questions. In the process, we refine and expand the empirical landscape by demonstrating that focused NPIs give rise to a variety of apparently heterogeneous contextual inferences, including domain widening in statements and inferences of negative bias in questions. These inferences are further shown to be modulated in subtle ways depending on the specific clause-type in which the NPI occurs (e.g., polar questions vs. wh-questions) and the type of emphatic NPI involved (e.g., ANY vs. lift a finger). Building on these empirical observations, we propose a unified account of NPIs which posits a single core semantic operator, even, across both focused and unfocused NPIs. What plays a central role in our account is the additive component of even, which we formulate in such a way that it applies uniformly across statements and questions. This additive component of even, intuitively paraphrased as the implication that all salient focus alternatives of the prejacent of the operator must be settled in the doxastic state of the speaker, is selectively activated depending on the presence of focus alternatives, and is shown to be able to derive all the observed contextual inferences stemming from focused NPIs, both in statements and in questions.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"34 1","pages":"1-68"},"PeriodicalIF":1.9,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81055982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-02DOI: 10.1186/s13326-022-00281-5
Leonardo Campillos-Llanos
Background: Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish.
Construction and content: This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System[Formula: see text] (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries.
Conclusions: The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.
{"title":"MedLexSp - a medical lexicon for Spanish medical natural language processing.","authors":"Leonardo Campillos-Llanos","doi":"10.1186/s13326-022-00281-5","DOIUrl":"10.1186/s13326-022-00281-5","url":null,"abstract":"<p><strong>Background: </strong>Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish.</p><p><strong>Construction and content: </strong>This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System[Formula: see text] (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries.</p><p><strong>Conclusions: </strong>The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"2"},"PeriodicalIF":1.9,"publicationDate":"2023-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9892682/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9619937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-31DOI: 10.1186/s13326-023-00282-y
Antonio Jose Jimeno Yepes, Karin Verspoor
<p><strong>Background: </strong>Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health.</p><p><strong>Objective: </strong>In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications.</p><p><strong>Methods: </strong>We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen.</p><p><strong>Results: </strong>We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents.</p><p><strong>Conclusions: </strong>We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisa
{"title":"Classifying literature mentions of biological pathogens as experimentally studied using natural language processing.","authors":"Antonio Jose Jimeno Yepes, Karin Verspoor","doi":"10.1186/s13326-023-00282-y","DOIUrl":"10.1186/s13326-023-00282-y","url":null,"abstract":"<p><strong>Background: </strong>Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health.</p><p><strong>Objective: </strong>In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications.</p><p><strong>Methods: </strong>We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen.</p><p><strong>Results: </strong>We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents.</p><p><strong>Conclusions: </strong>We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisa","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"1"},"PeriodicalIF":1.9,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9889128/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9243626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-21DOI: 10.1186/s13326-022-00279-z
Yongqun He, Hong Yu, Anthony Huffman, Asiyah Yu Lin, Darren A Natale, John Beverley, Ling Zheng, Yehoshua Perl, Zhigang Wang, Yingtong Liu, Edison Ong, Yang Wang, Philip Huang, Long Tran, Jinyang Du, Zalan Shah, Easheta Shah, Roshan Desai, Hsin-Hui Huang, Yujia Tian, Eric Merrell, William D Duncan, Sivaram Arabandi, Lynn M Schriml, Jie Zheng, Anna Maria Masci, Liwei Wang, Hongfang Liu, Fatima Zohra Smaili, Robert Hoehndorf, Zoë May Pendlington, Paola Roncaglia, Xianwei Ye, Jiangan Xie, Yi-Wei Tang, Xiaolin Yang, Suyuan Peng, Luxia Zhang, Luonan Chen, Junguk Hur, Gilbert S Omenn, Brian Athey, Barry Smith
Background: The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020.
Results: As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment.
Conclusion: CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications.
{"title":"A comprehensive update on CIDO: the community-based coronavirus infectious disease ontology.","authors":"Yongqun He, Hong Yu, Anthony Huffman, Asiyah Yu Lin, Darren A Natale, John Beverley, Ling Zheng, Yehoshua Perl, Zhigang Wang, Yingtong Liu, Edison Ong, Yang Wang, Philip Huang, Long Tran, Jinyang Du, Zalan Shah, Easheta Shah, Roshan Desai, Hsin-Hui Huang, Yujia Tian, Eric Merrell, William D Duncan, Sivaram Arabandi, Lynn M Schriml, Jie Zheng, Anna Maria Masci, Liwei Wang, Hongfang Liu, Fatima Zohra Smaili, Robert Hoehndorf, Zoë May Pendlington, Paola Roncaglia, Xianwei Ye, Jiangan Xie, Yi-Wei Tang, Xiaolin Yang, Suyuan Peng, Luxia Zhang, Luonan Chen, Junguk Hur, Gilbert S Omenn, Brian Athey, Barry Smith","doi":"10.1186/s13326-022-00279-z","DOIUrl":"10.1186/s13326-022-00279-z","url":null,"abstract":"<p><strong>Background: </strong>The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020.</p><p><strong>Results: </strong>As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment.</p><p><strong>Conclusion: </strong>CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"13 1","pages":"25"},"PeriodicalIF":1.6,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9585694/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9587760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-08DOI: 10.1186/s13326-022-00277-1
John Grimes, Piotr Szul, Alejandro Metke-Jimenez, Michael Lawley, Kylynn Loi
Background: Health data analytics is an area that is facing rapid change due to the acceleration of digitization of the health sector, and the changing landscape of health data and clinical terminology standards. Our research has identified a need for improved tooling to support analytics users in the task of analyzing Fast Healthcare Interoperability Resources (FHIR®) data and associated clinical terminology.
Results: A server implementation was developed, featuring a FHIR API with new operations designed to support exploratory data analysis (EDA), advanced patient cohort selection and data preparation tasks. Integration with a FHIR Terminology Service is also supported, allowing users to incorporate knowledge from rich terminologies such as SNOMED CT within their queries. A prototype user interface for EDA was developed, along with visualizations in support of a health data analysis project.
Conclusions: Experience with applying this technology within research projects and towards the development of analytics-enabled applications provides a preliminary indication that the FHIR Analytics API pattern implemented by Pathling is a valuable abstraction for data scientists and software developers within the health care domain. Pathling contributes towards the value proposition for the use of FHIR within health data analytics, and assists with the use of complex clinical terminologies in that context.
{"title":"Pathling: analytics on FHIR.","authors":"John Grimes, Piotr Szul, Alejandro Metke-Jimenez, Michael Lawley, Kylynn Loi","doi":"10.1186/s13326-022-00277-1","DOIUrl":"https://doi.org/10.1186/s13326-022-00277-1","url":null,"abstract":"<p><strong>Background: </strong>Health data analytics is an area that is facing rapid change due to the acceleration of digitization of the health sector, and the changing landscape of health data and clinical terminology standards. Our research has identified a need for improved tooling to support analytics users in the task of analyzing Fast Healthcare Interoperability Resources (FHIR<sup>®</sup>) data and associated clinical terminology.</p><p><strong>Results: </strong>A server implementation was developed, featuring a FHIR API with new operations designed to support exploratory data analysis (EDA), advanced patient cohort selection and data preparation tasks. Integration with a FHIR Terminology Service is also supported, allowing users to incorporate knowledge from rich terminologies such as SNOMED CT within their queries. A prototype user interface for EDA was developed, along with visualizations in support of a health data analysis project.</p><p><strong>Conclusions: </strong>Experience with applying this technology within research projects and towards the development of analytics-enabled applications provides a preliminary indication that the FHIR Analytics API pattern implemented by Pathling is a valuable abstraction for data scientists and software developers within the health care domain. Pathling contributes towards the value proposition for the use of FHIR within health data analytics, and assists with the use of complex clinical terminologies in that context.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"13 1","pages":"23"},"PeriodicalIF":1.9,"publicationDate":"2022-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9455941/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10470739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Annemarie van Dooren, Anouk Dieuleveut, Ailís Cournane, V. Hacquard
This paper investigates how children figure out that modals like must can be used to express both epistemic and “root” (i.e. non epistemic) flavors. The existing acquisition literature shows that children produce modals with epistemic meanings up to a year later than with root meanings. We conducted a corpus study to examine how modality is expressed in speech to and by young children, to investigate the ways in which the linguistic input children hear may help or hinder them in uncovering the flavor flexibility of modals. Our results show that the way parents use modals may obscure the fact that they can express epistemic flavors: modals are very rarely used epistemically. Yet, children eventually figure it out; our results suggest that some do so even before age 3. To investigate how children pick up on epistemic flavors, we explore distributional cues that distinguish roots and epistemics. The semantic literature argues they differ in “temporal orientation” (Condoravdi, 2002): while epistemics can have present or past orientation, root modals tend to be constrained to future orientation (Werner 2006; Klecha, 2016; Rullmann & Matthewson, 2018). We show that in child-directed speech, this constraint is well-reflected in the distribution of aspectual features of roots and epistemics, but that the signal might be weak given the strong usage bias towards roots. We discuss (a) what these results imply for how children might acquire adult-like modal representations, and (b) possible learning paths towards adult-like modal representations.
{"title":"Figuring Out Root and Epistemic Uses of Modals: The Role of the Input","authors":"Annemarie van Dooren, Anouk Dieuleveut, Ailís Cournane, V. Hacquard","doi":"10.1093/jos/ffac010","DOIUrl":"https://doi.org/10.1093/jos/ffac010","url":null,"abstract":"\u0000 This paper investigates how children figure out that modals like must can be used to express both epistemic and “root” (i.e. non epistemic) flavors. The existing acquisition literature shows that children produce modals with epistemic meanings up to a year later than with root meanings. We conducted a corpus study to examine how modality is expressed in speech to and by young children, to investigate the ways in which the linguistic input children hear may help or hinder them in uncovering the flavor flexibility of modals. Our results show that the way parents use modals may obscure the fact that they can express epistemic flavors: modals are very rarely used epistemically. Yet, children eventually figure it out; our results suggest that some do so even before age 3. To investigate how children pick up on epistemic flavors, we explore distributional cues that distinguish roots and epistemics. The semantic literature argues they differ in “temporal orientation” (Condoravdi, 2002): while epistemics can have present or past orientation, root modals tend to be constrained to future orientation (Werner 2006; Klecha, 2016; Rullmann & Matthewson, 2018). We show that in child-directed speech, this constraint is well-reflected in the distribution of aspectual features of roots and epistemics, but that the signal might be weak given the strong usage bias towards roots. We discuss (a) what these results imply for how children might acquire adult-like modal representations, and (b) possible learning paths towards adult-like modal representations.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"81 1","pages":"581-616"},"PeriodicalIF":1.9,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79294875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-03DOI: 10.1186/s13326-022-00267-3
Hoang-Quynh Le, Duy-Cat Can, Nigel Collier
{"title":"Exploiting document graphs for inter sentence relation extraction","authors":"Hoang-Quynh Le, Duy-Cat Can, Nigel Collier","doi":"10.1186/s13326-022-00267-3","DOIUrl":"https://doi.org/10.1186/s13326-022-00267-3","url":null,"abstract":"","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"13 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"65845317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-03DOI: 10.1186/s13326-022-00270-8
Sanchez-Graillet, Olivia, Witte, Christian, Grimm, Frank, Grautoff, Steffen, Ell, Basil, Cimiano, Philipp
Evidence-based medicine propagates that medical/clinical decisions are made by taking into account high-quality evidence, most notably in the form of randomized clinical trials. Evidence-based decision-making requires aggregating the evidence available in multiple trials to reach –by means of systematic reviews– a conclusive recommendation on which treatment is best suited for a given patient population. However, it is challenging to produce systematic reviews to keep up with the ever-growing number of published clinical trials. Therefore, new computational approaches are necessary to support the creation of systematic reviews that include the most up-to-date evidence.We propose a method to synthesize the evidence available in clinical trials in an ad-hoc and on-demand manner by automatically arranging such evidence in the form of a hierarchical argument that recommends a therapy as being superior to some other therapy along a number of key dimensions corresponding to the clinical endpoints of interest. The method has also been implemented as a web tool that allows users to explore the effects of excluding different points of evidence, and indicating relative preferences on the endpoints. Through two use cases, our method was shown to be able to generate conclusions similar to the ones of published systematic reviews. To evaluate our method implemented as a web tool, we carried out a survey and usability analysis with medical professionals. The results show that the tool was perceived as being valuable, acknowledging its potential to inform clinical decision-making and to complement the information from existing medical guidelines. The method presented is a simple but yet effective argumentation-based method that contributes to support the synthesis of clinical trial evidence. A current limitation of the method is that it relies on a manually populated knowledge base. This problem could be alleviated by deploying natural language processing methods to extract the relevant information from publications.
{"title":"Synthesizing evidence from clinical trials with dynamic interactive argument trees","authors":"Sanchez-Graillet, Olivia, Witte, Christian, Grimm, Frank, Grautoff, Steffen, Ell, Basil, Cimiano, Philipp","doi":"10.1186/s13326-022-00270-8","DOIUrl":"https://doi.org/10.1186/s13326-022-00270-8","url":null,"abstract":"Evidence-based medicine propagates that medical/clinical decisions are made by taking into account high-quality evidence, most notably in the form of randomized clinical trials. Evidence-based decision-making requires aggregating the evidence available in multiple trials to reach –by means of systematic reviews– a conclusive recommendation on which treatment is best suited for a given patient population. However, it is challenging to produce systematic reviews to keep up with the ever-growing number of published clinical trials. Therefore, new computational approaches are necessary to support the creation of systematic reviews that include the most up-to-date evidence.We propose a method to synthesize the evidence available in clinical trials in an ad-hoc and on-demand manner by automatically arranging such evidence in the form of a hierarchical argument that recommends a therapy as being superior to some other therapy along a number of key dimensions corresponding to the clinical endpoints of interest. The method has also been implemented as a web tool that allows users to explore the effects of excluding different points of evidence, and indicating relative preferences on the endpoints. Through two use cases, our method was shown to be able to generate conclusions similar to the ones of published systematic reviews. To evaluate our method implemented as a web tool, we carried out a survey and usability analysis with medical professionals. The results show that the tool was perceived as being valuable, acknowledging its potential to inform clinical decision-making and to complement the information from existing medical guidelines. The method presented is a simple but yet effective argumentation-based method that contributes to support the synthesis of clinical trial evidence. A current limitation of the method is that it relies on a manually populated knowledge base. This problem could be alleviated by deploying natural language processing methods to extract the relevant information from publications.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"22 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138538451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1186/s13326-022-00271-7
Olivia Sanchez-Graillet, Christian Witte, Frank Grimm, P. Cimiano
{"title":"An annotated corpus of clinical trial publications supporting schema-based relational information extraction","authors":"Olivia Sanchez-Graillet, Christian Witte, Frank Grimm, P. Cimiano","doi":"10.1186/s13326-022-00271-7","DOIUrl":"https://doi.org/10.1186/s13326-022-00271-7","url":null,"abstract":"","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":" ","pages":""},"PeriodicalIF":1.9,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43016385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-08DOI: 10.1186/s13326-022-00269-1
Lucas Emanuel Silva E Oliveira, Ana Carolina Peters, Adalniza Moura Pucca da Silva, Caroline Pilatti Gebeluca, Yohan Bonescki Gumiel, Lilian Mie Mukai Cintho, Deborah Ribeiro Carvalho, Sadid Al Hasan, Claudia Maria Cabral Moro
Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.
Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations.
Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores.
Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.
{"title":"SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.","authors":"Lucas Emanuel Silva E Oliveira, Ana Carolina Peters, Adalniza Moura Pucca da Silva, Caroline Pilatti Gebeluca, Yohan Bonescki Gumiel, Lilian Mie Mukai Cintho, Deborah Ribeiro Carvalho, Sadid Al Hasan, Claudia Maria Cabral Moro","doi":"10.1186/s13326-022-00269-1","DOIUrl":"https://doi.org/10.1186/s13326-022-00269-1","url":null,"abstract":"<p><strong>Background: </strong>The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.</p><p><strong>Methods: </strong>In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations.</p><p><strong>Results: </strong>This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores.</p><p><strong>Conclusion: </strong>The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"13 1","pages":"13"},"PeriodicalIF":1.9,"publicationDate":"2022-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9080187/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10252310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}