Pub Date : 2024-10-17DOI: 10.1186/s13326-024-00320-3
Sabrina Toro, Anna V Anagnostopoulos, Susan M Bello, Kai Blumberg, Rhiannon Cameron, Leigh Carmody, Alexander D Diehl, Damion M Dooley, William D Duncan, Petra Fey, Pascale Gaudet, Nomi L Harris, Marcin P Joachimiak, Leila Kiani, Tiago Lubiana, Monica C Munoz-Torres, Shawn O'Neil, David Osumi-Sutherland, Aleix Puig-Barbe, Justin T Reese, Leonore Reiser, Sofia Mc Robb, Troy Ruemping, James Seager, Eric Sid, Ray Stefancsik, Magalie Weber, Valerie Wood, Melissa A Haendel, Christopher J Mungall
Background: Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.
Results: We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues.
Conclusions: These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.
{"title":"Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI).","authors":"Sabrina Toro, Anna V Anagnostopoulos, Susan M Bello, Kai Blumberg, Rhiannon Cameron, Leigh Carmody, Alexander D Diehl, Damion M Dooley, William D Duncan, Petra Fey, Pascale Gaudet, Nomi L Harris, Marcin P Joachimiak, Leila Kiani, Tiago Lubiana, Monica C Munoz-Torres, Shawn O'Neil, David Osumi-Sutherland, Aleix Puig-Barbe, Justin T Reese, Leonore Reiser, Sofia Mc Robb, Troy Ruemping, James Seager, Eric Sid, Ray Stefancsik, Magalie Weber, Valerie Wood, Melissa A Haendel, Christopher J Mungall","doi":"10.1186/s13326-024-00320-3","DOIUrl":"https://doi.org/10.1186/s13326-024-00320-3","url":null,"abstract":"<p><strong>Background: </strong>Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.</p><p><strong>Results: </strong>We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues.</p><p><strong>Conclusions: </strong>These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"19"},"PeriodicalIF":1.6,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484368/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-02DOI: 10.1186/s13326-024-00319-w
Houcemeddine Turki, Bonaventure F P Dossou, Chris Chinenye Emezue, Abraham Toluwase Owodunni, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Hanen Ben Hassen, Afif Masmoudi
Biomedical relation classification has been significantly improved by the application of advanced machine learning techniques on the raw texts of scholarly publications. Despite this improvement, the reliance on large chunks of raw text makes these algorithms suffer in terms of generalization, precision, and reliability. The use of the distinctive characteristics of bibliographic metadata can prove effective in achieving better performance for this challenging task. In this research paper, we introduce an approach for biomedical relation classification using the qualifiers of co-occurring Medical Subject Headings (MeSH). First of all, we introduce MeSH2Matrix, our dataset consisting of 46,469 biomedical relations curated from PubMed publications using our approach. Our dataset includes a matrix that maps associations between the qualifiers of subject MeSH keywords and those of object MeSH keywords. It also specifies the corresponding Wikidata relation type and the superclass of semantic relations for each relation. Using MeSH2Matrix, we build and train three machine learning models (Support Vector Machine [SVM], a dense model [D-Model], and a convolutional neural network [C-Net]) to evaluate the efficiency of our approach for biomedical relation classification. Our best model achieves an accuracy of 70.78% for 195 classes and 83.09% for five superclasses. Finally, we provide confusion matrix and extensive feature analyses to better examine the relationship between the MeSH qualifiers and the biomedical relations being classified. Our results will hopefully shed light on developing better algorithms for biomedical ontology classification based on the MeSH keywords of PubMed publications. For reproducibility purposes, MeSH2Matrix, as well as all our source codes, are made publicly accessible at https://github.com/SisonkeBiotik-Africa/MeSH2Matrix .
{"title":"MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed.","authors":"Houcemeddine Turki, Bonaventure F P Dossou, Chris Chinenye Emezue, Abraham Toluwase Owodunni, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Hanen Ben Hassen, Afif Masmoudi","doi":"10.1186/s13326-024-00319-w","DOIUrl":"10.1186/s13326-024-00319-w","url":null,"abstract":"<p><p>Biomedical relation classification has been significantly improved by the application of advanced machine learning techniques on the raw texts of scholarly publications. Despite this improvement, the reliance on large chunks of raw text makes these algorithms suffer in terms of generalization, precision, and reliability. The use of the distinctive characteristics of bibliographic metadata can prove effective in achieving better performance for this challenging task. In this research paper, we introduce an approach for biomedical relation classification using the qualifiers of co-occurring Medical Subject Headings (MeSH). First of all, we introduce MeSH2Matrix, our dataset consisting of 46,469 biomedical relations curated from PubMed publications using our approach. Our dataset includes a matrix that maps associations between the qualifiers of subject MeSH keywords and those of object MeSH keywords. It also specifies the corresponding Wikidata relation type and the superclass of semantic relations for each relation. Using MeSH2Matrix, we build and train three machine learning models (Support Vector Machine [SVM], a dense model [D-Model], and a convolutional neural network [C-Net]) to evaluate the efficiency of our approach for biomedical relation classification. Our best model achieves an accuracy of 70.78% for 195 classes and 83.09% for five superclasses. Finally, we provide confusion matrix and extensive feature analyses to better examine the relationship between the MeSH qualifiers and the biomedical relations being classified. Our results will hopefully shed light on developing better algorithms for biomedical ontology classification based on the MeSH keywords of PubMed publications. For reproducibility purposes, MeSH2Matrix, as well as all our source codes, are made publicly accessible at https://github.com/SisonkeBiotik-Africa/MeSH2Matrix .</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"18"},"PeriodicalIF":1.6,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11445994/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142361554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-15DOI: 10.1186/s13326-024-00316-z
Beata Fonferko-Shadrach, Huw Strafford, Carys Jones, Russell A. Khan, Sharon Brown, Jenny Edwards, Jonathan Hawken, Luke E. Shrimpton, Catharine P. White, Robert Powell, Inder M. S. Sawhney, William O. Pickrell, Arron S. Lacey
Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline. We created 200 synthetic clinic letters based on hospital outpatient consultations with epilepsy specialists. The letters were double annotated by trained clinicians and researchers according to agreed guidelines. We used the annotation tool, Markup, with an epilepsy concept list based on the Unified Medical Language System ontology. All annotations were reviewed, and a gold standard set of annotations was agreed and used to validate the performance of ExECTv2. The overall inter-annotator agreement (IAA) between the two sets of annotations produced a per item F1 score of 0.73. Validating ExECTv2 using the gold standard gave an overall F1 score of 0.87 per item, and 0.90 per letter. The synthetic letters, annotations, and annotation guidelines have been made freely available. To our knowledge, this is the first publicly available set of annotated epilepsy clinic letters and guidelines that can be used for NLP researchers with minimum epilepsy knowledge. The IAA results show that clinical text annotation tasks are difficult and require a gold standard to be arranged by researcher consensus. The results for ExECTv2, our automated epilepsy NLP pipeline, extracted detailed epilepsy information from unstructured epilepsy letters with more accuracy than human annotators, further confirming the utility of NLP for clinical and research applications.
{"title":"Annotation of epilepsy clinic letters for natural language processing","authors":"Beata Fonferko-Shadrach, Huw Strafford, Carys Jones, Russell A. Khan, Sharon Brown, Jenny Edwards, Jonathan Hawken, Luke E. Shrimpton, Catharine P. White, Robert Powell, Inder M. S. Sawhney, William O. Pickrell, Arron S. Lacey","doi":"10.1186/s13326-024-00316-z","DOIUrl":"https://doi.org/10.1186/s13326-024-00316-z","url":null,"abstract":"Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline. We created 200 synthetic clinic letters based on hospital outpatient consultations with epilepsy specialists. The letters were double annotated by trained clinicians and researchers according to agreed guidelines. We used the annotation tool, Markup, with an epilepsy concept list based on the Unified Medical Language System ontology. All annotations were reviewed, and a gold standard set of annotations was agreed and used to validate the performance of ExECTv2. The overall inter-annotator agreement (IAA) between the two sets of annotations produced a per item F1 score of 0.73. Validating ExECTv2 using the gold standard gave an overall F1 score of 0.87 per item, and 0.90 per letter. The synthetic letters, annotations, and annotation guidelines have been made freely available. To our knowledge, this is the first publicly available set of annotated epilepsy clinic letters and guidelines that can be used for NLP researchers with minimum epilepsy knowledge. The IAA results show that clinical text annotation tasks are difficult and require a gold standard to be arranged by researcher consensus. The results for ExECTv2, our automated epilepsy NLP pipeline, extracted detailed epilepsy information from unstructured epilepsy letters with more accuracy than human annotators, further confirming the utility of NLP for clinical and research applications.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"36 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1186/s13326-024-00317-y
Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Adriano Chió, Arianna Dagliati, Mamede de Carvalho, Marta Gromicho, Umberto Manera, Eleonora Tavazzi, Giorgio Maria Di Nunzio, Gianmaria Silvello, Nicola Ferro
Automatic disease progression prediction models require large amounts of training data, which are seldom available, especially when it comes to rare diseases. A possible solution is to integrate data from different medical centres. Nevertheless, various centres often follow diverse data collection procedures and assign different semantics to collected data. Ontologies, used as schemas for interoperable knowledge bases, represent a state-of-the-art solution to homologate the semantics and foster data integration from various sources. This work presents the BrainTeaser Ontology (BTO), an ontology that models the clinical data associated with two brain-related rare diseases (ALS and MS) in a comprehensive and modular manner. BTO assists in organizing and standardizing the data collected during patient follow-up. It was created by harmonizing schemas currently used by multiple medical centers into a common ontology, following a bottom-up approach. As a result, BTO effectively addresses the practical data collection needs of various real-world situations and promotes data portability and interoperability. BTO captures various clinical occurrences, such as disease onset, symptoms, diagnostic and therapeutic procedures, and relapses, using an event-based approach. Developed in collaboration with medical partners and domain experts, BTO offers a holistic view of ALS and MS for supporting the representation of retrospective and prospective data. Furthermore, BTO adheres to Open Science and FAIR (Findable, Accessible, Interoperable, and Reusable) principles, making it a reliable framework for developing predictive tools to aid in medical decision-making and patient care. Although BTO is designed for ALS and MS, its modular structure makes it easily extendable to other brain-related diseases, showcasing its potential for broader applicability.Database URL https://zenodo.org/records/7886998 .
自动疾病进展预测模型需要大量的训练数据,而这些数据很少能获得,尤其是在罕见疾病方面。一个可行的解决方案是整合来自不同医疗中心的数据。然而,不同的中心通常采用不同的数据收集程序,并为收集到的数据分配不同的语义。本体作为可互操作知识库的模式,代表了一种最先进的解决方案,可实现语义同源并促进来自不同来源的数据整合。这项工作提出了脑激酶本体(BTO),本体以全面和模块化的方式对与两种脑相关罕见疾病(ALS 和 MS)相关的临床数据进行建模。BTO 有助于对患者随访过程中收集的数据进行组织和标准化。它是通过自下而上的方法,将多个医疗中心目前使用的模式统一为一个通用本体而创建的。因此,BTO 能有效满足各种实际情况下的数据收集需求,并促进数据的可移植性和互操作性。BTO 采用基于事件的方法捕获各种临床事件,如疾病发病、症状、诊断和治疗过程以及复发。BTO 是与医疗合作伙伴和领域专家合作开发的,它提供了 ALS 和 MS 的整体视图,支持回顾性和前瞻性数据的表示。此外,BTO 遵循开放科学和 FAIR(可查找、可访问、可互操作和可重用)原则,是开发预测工具的可靠框架,有助于医疗决策和患者护理。虽然 BTO 是针对 ALS 和 MS 而设计的,但其模块化结构使其很容易扩展到其他脑相关疾病,从而展示了其更广泛的应用潜力。数据库网址 https://zenodo.org/records/7886998 。
{"title":"An extensible and unifying approach to retrospective clinical data modeling: the BrainTeaser Ontology.","authors":"Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Adriano Chió, Arianna Dagliati, Mamede de Carvalho, Marta Gromicho, Umberto Manera, Eleonora Tavazzi, Giorgio Maria Di Nunzio, Gianmaria Silvello, Nicola Ferro","doi":"10.1186/s13326-024-00317-y","DOIUrl":"10.1186/s13326-024-00317-y","url":null,"abstract":"<p><p>Automatic disease progression prediction models require large amounts of training data, which are seldom available, especially when it comes to rare diseases. A possible solution is to integrate data from different medical centres. Nevertheless, various centres often follow diverse data collection procedures and assign different semantics to collected data. Ontologies, used as schemas for interoperable knowledge bases, represent a state-of-the-art solution to homologate the semantics and foster data integration from various sources. This work presents the BrainTeaser Ontology (BTO), an ontology that models the clinical data associated with two brain-related rare diseases (ALS and MS) in a comprehensive and modular manner. BTO assists in organizing and standardizing the data collected during patient follow-up. It was created by harmonizing schemas currently used by multiple medical centers into a common ontology, following a bottom-up approach. As a result, BTO effectively addresses the practical data collection needs of various real-world situations and promotes data portability and interoperability. BTO captures various clinical occurrences, such as disease onset, symptoms, diagnostic and therapeutic procedures, and relapses, using an event-based approach. Developed in collaboration with medical partners and domain experts, BTO offers a holistic view of ALS and MS for supporting the representation of retrospective and prospective data. Furthermore, BTO adheres to Open Science and FAIR (Findable, Accessible, Interoperable, and Reusable) principles, making it a reliable framework for developing predictive tools to aid in medical decision-making and patient care. Although BTO is designed for ALS and MS, its modular structure makes it easily extendable to other brain-related diseases, showcasing its potential for broader applicability.Database URL https://zenodo.org/records/7886998 .</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"16"},"PeriodicalIF":1.6,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363415/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142107743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1186/s13326-024-00315-0
William D Duncan, Matthew Diller, Damion Dooley, William R Hogan, John Beverley
Background: Within the Open Biological and Biomedical Ontology (OBO) Foundry, many ontologies represent the execution of a plan specification as a process in which a realizable entity that concretizes the plan specification, a "realizable concretization" (RC), is realized. This representation, which we call the "RC-account", provides a straightforward way to relate a plan specification to the entity that bears the realizable concretization and the process that realizes the realizable concretization. However, the adequacy of the RC-account has not been evaluated in the scientific literature. In this manuscript, we provide this evaluation and, thereby, give ontology developers sound reasons to use or not use the RC-account pattern.
Results: Analysis of the RC-account reveals that it is not adequate for representing failed plans. If the realizable concretization is flawed in some way, it is unclear what (if any) relation holds between the realizable entity and the plan specification. If the execution (i.e., realization) of the realizable concretization fails to carry out the actions given in the plan specification, it is unclear under the RC-account how to directly relate the failed execution to the entity carrying out the instructions given in the plan specification. These issues are exacerbated in the presence of changing plans.
Conclusions: We propose two solutions for representing failed plans. The first uses the Common Core Ontologies 'prescribed by' relation to connect a plan specification to the entity or process that utilizes the plan specification as a guide. The second, more complex, solution incorporates the process of creating a plan (in the sense of an intention to execute a plan specification) into the representation of executing plan specifications. We hypothesize that the first solution (i.e., use of 'prescribed by') is adequate for most situations. However, more research is needed to test this hypothesis as well as explore the other solutions presented in this manuscript.
{"title":"Concretizing plan specifications as realizables within the OBO foundry.","authors":"William D Duncan, Matthew Diller, Damion Dooley, William R Hogan, John Beverley","doi":"10.1186/s13326-024-00315-0","DOIUrl":"10.1186/s13326-024-00315-0","url":null,"abstract":"<p><strong>Background: </strong>Within the Open Biological and Biomedical Ontology (OBO) Foundry, many ontologies represent the execution of a plan specification as a process in which a realizable entity that concretizes the plan specification, a \"realizable concretization\" (RC), is realized. This representation, which we call the \"RC-account\", provides a straightforward way to relate a plan specification to the entity that bears the realizable concretization and the process that realizes the realizable concretization. However, the adequacy of the RC-account has not been evaluated in the scientific literature. In this manuscript, we provide this evaluation and, thereby, give ontology developers sound reasons to use or not use the RC-account pattern.</p><p><strong>Results: </strong>Analysis of the RC-account reveals that it is not adequate for representing failed plans. If the realizable concretization is flawed in some way, it is unclear what (if any) relation holds between the realizable entity and the plan specification. If the execution (i.e., realization) of the realizable concretization fails to carry out the actions given in the plan specification, it is unclear under the RC-account how to directly relate the failed execution to the entity carrying out the instructions given in the plan specification. These issues are exacerbated in the presence of changing plans.</p><p><strong>Conclusions: </strong>We propose two solutions for representing failed plans. The first uses the Common Core Ontologies 'prescribed by' relation to connect a plan specification to the entity or process that utilizes the plan specification as a guide. The second, more complex, solution incorporates the process of creating a plan (in the sense of an intention to execute a plan specification) into the representation of executing plan specifications. We hypothesize that the first solution (i.e., use of 'prescribed by') is adequate for most situations. However, more research is needed to test this hypothesis as well as explore the other solutions presented in this manuscript.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"15"},"PeriodicalIF":1.6,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11334599/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142004295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1186/s13326-024-00318-x
Jianfu Li, Yiming Li, Yuanyi Pan, Jinjing Guo, Zenan Sun, Fang Li, Yongqun He, Cui Tao
Background: Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects.
Clinicaltrials: gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance.
Results: In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy.
Conclusion: This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
{"title":"Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models.","authors":"Jianfu Li, Yiming Li, Yuanyi Pan, Jinjing Guo, Zenan Sun, Fang Li, Yongqun He, Cui Tao","doi":"10.1186/s13326-024-00318-x","DOIUrl":"10.1186/s13326-024-00318-x","url":null,"abstract":"<p><strong>Background: </strong>Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects.</p><p><strong>Clinicaltrials: </strong>gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance.</p><p><strong>Results: </strong>In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy.</p><p><strong>Conclusion: </strong>This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"14"},"PeriodicalIF":1.6,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11316402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141912790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1186/s13326-024-00314-1
Sarah Mullin, Robert McDougal, Kei-Hoi Cheung, Halil Kilicoglu, Amanda Beck, Caroline J Zeiss
Background: Identifying chemical mentions within the Alzheimer's and dementia literature can provide a powerful tool to further therapeutic research. Leveraging the Chemical Entities of Biological Interest (ChEBI) ontology, which is rich in hierarchical and other relationship types, for entity normalization can provide an advantage for future downstream applications. We provide a reproducible hybrid approach that combines an ontology-enhanced PubMedBERT model for disambiguation with a dictionary-based method for candidate selection.
Results: There were 56,553 chemical mentions in the titles of 44,812 unique PubMed article abstracts. Based on our gold standard, our method of disambiguation improved entity normalization by 25.3 percentage points compared to using only the dictionary-based approach with fuzzy-string matching for disambiguation. For the CRAFT corpus, our method outperformed baselines (maximum 78.4%) with a 91.17% accuracy. For our Alzheimer's and dementia cohort, we were able to add 47.1% more potential mappings between MeSH and ChEBI when compared to BioPortal.
Conclusion: Use of natural language models like PubMedBERT and resources such as ChEBI and PubChem provide a beneficial way to link entity mentions to ontology terms, while further supporting downstream tasks like filtering ChEBI mentions based on roles and assertions to find beneficial therapies for Alzheimer's and dementia.
{"title":"Chemical entity normalization for successful translational development of Alzheimer's disease and dementia therapeutics.","authors":"Sarah Mullin, Robert McDougal, Kei-Hoi Cheung, Halil Kilicoglu, Amanda Beck, Caroline J Zeiss","doi":"10.1186/s13326-024-00314-1","DOIUrl":"10.1186/s13326-024-00314-1","url":null,"abstract":"<p><strong>Background: </strong>Identifying chemical mentions within the Alzheimer's and dementia literature can provide a powerful tool to further therapeutic research. Leveraging the Chemical Entities of Biological Interest (ChEBI) ontology, which is rich in hierarchical and other relationship types, for entity normalization can provide an advantage for future downstream applications. We provide a reproducible hybrid approach that combines an ontology-enhanced PubMedBERT model for disambiguation with a dictionary-based method for candidate selection.</p><p><strong>Results: </strong>There were 56,553 chemical mentions in the titles of 44,812 unique PubMed article abstracts. Based on our gold standard, our method of disambiguation improved entity normalization by 25.3 percentage points compared to using only the dictionary-based approach with fuzzy-string matching for disambiguation. For the CRAFT corpus, our method outperformed baselines (maximum 78.4%) with a 91.17% accuracy. For our Alzheimer's and dementia cohort, we were able to add 47.1% more potential mappings between MeSH and ChEBI when compared to BioPortal.</p><p><strong>Conclusion: </strong>Use of natural language models like PubMedBERT and resources such as ChEBI and PubChem provide a beneficial way to link entity mentions to ontology terms, while further supporting downstream tasks like filtering ChEBI mentions based on roles and assertions to find beneficial therapies for Alzheimer's and dementia.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"13"},"PeriodicalIF":1.6,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11290083/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141855609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19DOI: 10.1186/s13326-024-00312-3
Jie Zheng, Xingxian Li, Anna Maria Masci, Hayleigh Kahn, Anthony Huffman, Eliyas Asfaw, Yuanyi Pan, Jinjing Guo, Virginia He, Justin Song, Andrey I Seleznev, Asiyah Yu Lin, Yongqun He
Background: The exploration of cancer vaccines has yielded a multitude of studies, resulting in a diverse collection of information. The heterogeneity of cancer vaccine data significantly impedes effective integration and analysis. While CanVaxKB serves as a pioneering database for over 670 manually annotated cancer vaccines, it is important to distinguish that a database, on its own, does not offer the structured relationships and standardized definitions found in an ontology. Recognizing this, we expanded the Vaccine Ontology (VO) to include those cancer vaccines present in CanVaxKB that were not initially covered, enhancing VO's capacity to systematically define and interrelate cancer vaccines.
Results: An ontology design pattern (ODP) was first developed and applied to semantically represent various cancer vaccines, capturing their associated entities and relations. By applying the ODP, we generated a cancer vaccine template in a tabular format and converted it into the RDF/OWL format for generation of cancer vaccine terms in the VO. '12MP vaccine' was used as an example of cancer vaccines to demonstrate the application of the ODP. VO also reuses reference ontology terms to represent entities such as cancer diseases and vaccine hosts. Description Logic (DL) and SPARQL query scripts were developed and used to query for cancer vaccines based on different vaccine's features and to demonstrate the versatility of the VO representation. Additionally, ontological modeling was applied to illustrate cancer vaccine related concepts and studies for in-depth cancer vaccine analysis. A cancer vaccine-specific VO view, referred to as "CVO," was generated, and it contains 928 classes including 704 cancer vaccines. The CVO OWL file is publicly available on: http://purl.obolibrary.org/obo/vo/cvo.owl , for sharing and applications.
Conclusion: To facilitate the standardization, integration, and analysis of cancer vaccine data, we expanded the Vaccine Ontology (VO) to systematically model and represent cancer vaccines. We also developed a pipeline to automate the inclusion of cancer vaccines and associated terms in the VO. This not only enriches the data's standardization and integration, but also leverages ontological modeling to deepen the analysis of cancer vaccine information, maximizing benefits for researchers and clinicians.
Availability: The VO-cancer GitHub website is: https://github.com/vaccineontology/VO/tree/master/CVO .
背景:对癌症疫苗的探索产生了大量的研究,从而收集了各种各样的信息。癌症疫苗数据的异质性严重阻碍了有效的整合与分析。尽管 CanVaxKB 是一个包含 670 多种人工注释癌症疫苗的开创性数据库,但必须区分的是,数据库本身并不能提供本体论中的结构化关系和标准化定义。认识到这一点后,我们对疫苗本体(VO)进行了扩展,纳入了 CanVaxKB 最初未涵盖的癌症疫苗,从而增强了 VO 系统定义和相互关联癌症疫苗的能力:我们首先开发了一种本体设计模式(ODP),并将其用于从语义上表示各种癌症疫苗,捕捉其相关实体和关系。通过应用 ODP,我们以表格格式生成了癌症疫苗模板,并将其转换为 RDF/OWL 格式,以便在 VO 中生成癌症疫苗术语。我们以 "12MP 疫苗 "为例演示了 ODP 的应用。VO 还重复使用参考本体术语来表示癌症疾病和疫苗宿主等实体。我们开发了描述逻辑(DL)和 SPARQL 查询脚本,用于根据不同疫苗的特征查询癌症疫苗,以展示 VO 表示法的多功能性。此外,还应用本体论建模来说明癌症疫苗的相关概念和研究,以便对癌症疫苗进行深入分析。生成的癌症疫苗专用 VO 视图被称为 "CVO",包含 928 个类,其中包括 704 种癌症疫苗。CVO OWL 文件可在 http://purl.obolibrary.org/obo/vo/cvo.owl 上公开获取,以供共享和应用:为了促进癌症疫苗数据的标准化、集成和分析,我们扩展了疫苗本体(VO),以系统地建模和表示癌症疫苗。我们还开发了一个管道,可自动将癌症疫苗和相关术语纳入 VO。这不仅丰富了数据的标准化和集成,还利用本体论建模加深了对癌症疫苗信息的分析,为研究人员和临床医生带来了最大益处:VO-cancer GitHub 网站:https://github.com/vaccineontology/VO/tree/master/CVO 。
{"title":"Empowering standardization of cancer vaccines through ontology: enhanced modeling and data analysis.","authors":"Jie Zheng, Xingxian Li, Anna Maria Masci, Hayleigh Kahn, Anthony Huffman, Eliyas Asfaw, Yuanyi Pan, Jinjing Guo, Virginia He, Justin Song, Andrey I Seleznev, Asiyah Yu Lin, Yongqun He","doi":"10.1186/s13326-024-00312-3","DOIUrl":"10.1186/s13326-024-00312-3","url":null,"abstract":"<p><strong>Background: </strong>The exploration of cancer vaccines has yielded a multitude of studies, resulting in a diverse collection of information. The heterogeneity of cancer vaccine data significantly impedes effective integration and analysis. While CanVaxKB serves as a pioneering database for over 670 manually annotated cancer vaccines, it is important to distinguish that a database, on its own, does not offer the structured relationships and standardized definitions found in an ontology. Recognizing this, we expanded the Vaccine Ontology (VO) to include those cancer vaccines present in CanVaxKB that were not initially covered, enhancing VO's capacity to systematically define and interrelate cancer vaccines.</p><p><strong>Results: </strong>An ontology design pattern (ODP) was first developed and applied to semantically represent various cancer vaccines, capturing their associated entities and relations. By applying the ODP, we generated a cancer vaccine template in a tabular format and converted it into the RDF/OWL format for generation of cancer vaccine terms in the VO. '12MP vaccine' was used as an example of cancer vaccines to demonstrate the application of the ODP. VO also reuses reference ontology terms to represent entities such as cancer diseases and vaccine hosts. Description Logic (DL) and SPARQL query scripts were developed and used to query for cancer vaccines based on different vaccine's features and to demonstrate the versatility of the VO representation. Additionally, ontological modeling was applied to illustrate cancer vaccine related concepts and studies for in-depth cancer vaccine analysis. A cancer vaccine-specific VO view, referred to as \"CVO,\" was generated, and it contains 928 classes including 704 cancer vaccines. The CVO OWL file is publicly available on: http://purl.obolibrary.org/obo/vo/cvo.owl , for sharing and applications.</p><p><strong>Conclusion: </strong>To facilitate the standardization, integration, and analysis of cancer vaccine data, we expanded the Vaccine Ontology (VO) to systematically model and represent cancer vaccines. We also developed a pipeline to automate the inclusion of cancer vaccines and associated terms in the VO. This not only enriches the data's standardization and integration, but also leverages ontological modeling to deepen the analysis of cancer vaccine information, maximizing benefits for researchers and clinicians.</p><p><strong>Availability: </strong>The VO-cancer GitHub website is: https://github.com/vaccineontology/VO/tree/master/CVO .</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"12"},"PeriodicalIF":1.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11186274/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141419260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-07DOI: 10.1186/s13326-024-00311-4
Abdullateef I Almudaifer, Whitney Covington, JaMor Hairston, Zachary Deitch, Ankit Anand, Caleb M Carroll, Estera Crisan, William Bradford, Lauren A Walter, Ellen F Eaton, Sue S Feldman, John D Osborne
Background: The semantics of entities extracted from a clinical text can be dramatically altered by modifiers, including entity negation, uncertainty, conditionality, severity, and subject. Existing models for determining modifiers of clinical entities involve regular expression or features weights that are trained independently for each modifier.
Methods: We develop and evaluate a multi-task transformer architecture design where modifiers are learned and predicted jointly using the publicly available SemEval 2015 Task 14 corpus and a new Opioid Use Disorder (OUD) data set that contains modifiers shared with SemEval as well as novel modifiers specific for OUD. We evaluate the effectiveness of our multi-task learning approach versus previously published systems and assess the feasibility of transfer learning for clinical entity modifiers when only a portion of clinical modifiers are shared.
Results: Our approach achieved state-of-the-art results on the ShARe corpus from SemEval 2015 Task 14, showing an increase of 1.1% on weighted accuracy, 1.7% on unweighted accuracy, and 10% on micro F1 scores.
Conclusions: We show that learned weights from our shared model can be effectively transferred to a new partially matched data set, validating the use of transfer learning for clinical text modifiers.
{"title":"Multi-task transfer learning for the prediction of entity modifiers in clinical text: application to opioid use disorder case detection.","authors":"Abdullateef I Almudaifer, Whitney Covington, JaMor Hairston, Zachary Deitch, Ankit Anand, Caleb M Carroll, Estera Crisan, William Bradford, Lauren A Walter, Ellen F Eaton, Sue S Feldman, John D Osborne","doi":"10.1186/s13326-024-00311-4","DOIUrl":"10.1186/s13326-024-00311-4","url":null,"abstract":"<p><strong>Background: </strong>The semantics of entities extracted from a clinical text can be dramatically altered by modifiers, including entity negation, uncertainty, conditionality, severity, and subject. Existing models for determining modifiers of clinical entities involve regular expression or features weights that are trained independently for each modifier.</p><p><strong>Methods: </strong>We develop and evaluate a multi-task transformer architecture design where modifiers are learned and predicted jointly using the publicly available SemEval 2015 Task 14 corpus and a new Opioid Use Disorder (OUD) data set that contains modifiers shared with SemEval as well as novel modifiers specific for OUD. We evaluate the effectiveness of our multi-task learning approach versus previously published systems and assess the feasibility of transfer learning for clinical entity modifiers when only a portion of clinical modifiers are shared.</p><p><strong>Results: </strong>Our approach achieved state-of-the-art results on the ShARe corpus from SemEval 2015 Task 14, showing an increase of 1.1% on weighted accuracy, 1.7% on unweighted accuracy, and 10% on micro F1 scores.</p><p><strong>Conclusions: </strong>We show that learned weights from our shared model can be effectively transferred to a new partially matched data set, validating the use of transfer learning for clinical text modifiers.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"11"},"PeriodicalIF":1.9,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11157899/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141288154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-06DOI: 10.1186/s13326-024-00313-2
Lars Vogt, Tobias Kuhn, Robert Hoehndorf
{"title":"Correction to: Semantic units: organizing knowledge graphs into semantically meaningful units of representation.","authors":"Lars Vogt, Tobias Kuhn, Robert Hoehndorf","doi":"10.1186/s13326-024-00313-2","DOIUrl":"10.1186/s13326-024-00313-2","url":null,"abstract":"","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"10"},"PeriodicalIF":1.9,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11154969/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141283785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}