Pub Date : 2025-03-22DOI: 10.1186/s13326-025-00324-7
Sebastian Duesing, Jason Bennett, James A Overton, Randi Vita, Bjoern Peters
Background: While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e., removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different fields curated from the literature in the Immune Epitope Database (IEDB): "age" and "data-location" (the part of a paper in which data was found).
Results: Free text entries for the database fields for subject age (4095 distinct values) and publication data-location (251,810 distinct values) in the IEDB were analyzed. Normalization was performed in three steps, namely character normalization, word normalization, and phrase normalization, using generalizable rules developed and applied with the tool presented in this manuscript. For the age dataset, in the character stage, the application of 21 rules resulted in 99.97% output validity; in the word stage, the application of 94 rules resulted in 98.06% output validity; and in the phrase stage, the application of 16 rules resulted in 83.81% output validity. For the data-location dataset, in the character stage, the application of 39 rules resulted in 99.99% output validity; in the word stage, the application of 187 rules resulted in 98.46% output validity; and in the phrase stage, the application of 12 rules resulted in 97.95% output validity.
Conclusions: We developed a generalizable approach for normalization of free text as found in database fields with content on a specific topic. Creating and testing the rules took a one-time effort for a given field that can now be applied to data as it is being curated. The standardization achieved in two datasets tested produces significantly reduced variance in the content which enhances the findability and usability of that data, chiefly by improving search functionality and enabling linkages with formal ontologies.
{"title":"Standardizing free-text data exemplified by two fields from the Immune Epitope Database.","authors":"Sebastian Duesing, Jason Bennett, James A Overton, Randi Vita, Bjoern Peters","doi":"10.1186/s13326-025-00324-7","DOIUrl":"10.1186/s13326-025-00324-7","url":null,"abstract":"<p><strong>Background: </strong>While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e., removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different fields curated from the literature in the Immune Epitope Database (IEDB): \"age\" and \"data-location\" (the part of a paper in which data was found).</p><p><strong>Results: </strong>Free text entries for the database fields for subject age (4095 distinct values) and publication data-location (251,810 distinct values) in the IEDB were analyzed. Normalization was performed in three steps, namely character normalization, word normalization, and phrase normalization, using generalizable rules developed and applied with the tool presented in this manuscript. For the age dataset, in the character stage, the application of 21 rules resulted in 99.97% output validity; in the word stage, the application of 94 rules resulted in 98.06% output validity; and in the phrase stage, the application of 16 rules resulted in 83.81% output validity. For the data-location dataset, in the character stage, the application of 39 rules resulted in 99.99% output validity; in the word stage, the application of 187 rules resulted in 98.46% output validity; and in the phrase stage, the application of 12 rules resulted in 97.95% output validity.</p><p><strong>Conclusions: </strong>We developed a generalizable approach for normalization of free text as found in database fields with content on a specific topic. Creating and testing the rules took a one-time effort for a given field that can now be applied to data as it is being curated. The standardization achieved in two datasets tested produces significantly reduced variance in the content which enhances the findability and usability of that data, chiefly by improving search functionality and enabling linkages with formal ontologies.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"5"},"PeriodicalIF":2.0,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11929277/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143692223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The amount of biomedical data is growing, and managing it is increasingly challenging. While Findable, Accessible, Interoperable and Reusable (FAIR) data principles provide guidance, their adoption has proven difficult, especially in larger enterprises like pharmaceutical companies. In this manuscript, we describe how we leverage an Ontology-Based Data Management (OBDM) strategy for digital transformation in Novo Nordisk Research & Early Development. Here, we include both our technical blueprint and our approach for organizational change management. We further discuss how such an OBDM ecosystem plays a pivotal role in the organization's digital aspirations for data federation and discovery fuelled by artificial intelligence. Our aim for this paper is to share the lessons learned in order to foster dialogue with parties navigating similar waters while collectively advancing the efforts in the fields of data management, semantics and data driven drug discovery.
{"title":"Digital evolution: Novo Nordisk's shift to ontology-based data management.","authors":"Shawn Zheng Kai Tan, Shounak Baksi, Thomas Gade Bjerregaard, Preethi Elangovan, Thrishna Kuttikattu Gopalakrishnan, Darko Hric, Joffrey Joumaa, Beidi Li, Kashif Rabbani, Santhosh Kannan Venkatesan, Joshua Daniel Valdez, Saritha Vettikunnel Kuriakose","doi":"10.1186/s13326-025-00327-4","DOIUrl":"10.1186/s13326-025-00327-4","url":null,"abstract":"<p><p>The amount of biomedical data is growing, and managing it is increasingly challenging. While Findable, Accessible, Interoperable and Reusable (FAIR) data principles provide guidance, their adoption has proven difficult, especially in larger enterprises like pharmaceutical companies. In this manuscript, we describe how we leverage an Ontology-Based Data Management (OBDM) strategy for digital transformation in Novo Nordisk Research & Early Development. Here, we include both our technical blueprint and our approach for organizational change management. We further discuss how such an OBDM ecosystem plays a pivotal role in the organization's digital aspirations for data federation and discovery fuelled by artificial intelligence. Our aim for this paper is to share the lessons learned in order to foster dialogue with parties navigating similar waters while collectively advancing the efforts in the fields of data management, semantics and data driven drug discovery.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"6"},"PeriodicalIF":2.0,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11929979/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143692220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-21DOI: 10.1186/s13326-025-00323-8
Gene Godbold, Jody Proescher, Pascale Gaudet
Background: There is a new framework from the United States government for screening synthetic nucleic acids. Beginning in October of 2026, it calls for the screening of sequences 50 nucleotides or greater in length that are known to contribute to pathogenicity or toxicity for humans, regardless of the taxa from which it originates. Distinguishing sequences that encode pathogenic and toxic functions from those that lack them is not simple.
Objectives: Our project scope was to discern, describe, and catalog sequences involved in microbial pathogenesis from the scientific literature. We recognize a need for better terminology to designate pathogenic functions that are relevant across the entire range of existing parasites.
Methods: We canvassed publications investigating microbial pathogens of humans, other animals, and some plants to collect thousands of sequences that enable the exploitation of hosts. We compared sequences to each other, grouping them according to what host biological processes they subvert and the consequence(s) for the host. We developed terms to capture many of the varied pathogenic functions for sequences employed by parasitic microbes for host exploitation and applied these terms in a systematic manner to our dataset of sequences.
Results/conclusions: The enhanced and expanded terms enable a quick and pertinent evaluation of a sequence's ability to endow a microbe with pathogenic function when they are appropriately applied to relevant sequences. This will allow providers of synthetic nucleic acids to rapidly assess sequences ordered by their customers for pathogenic capacity. This will help fulfill the new US government guidance.
{"title":"New and revised gene ontology biological process terms describe multiorganism interactions critical for understanding microbial pathogenesis and sequences of concern.","authors":"Gene Godbold, Jody Proescher, Pascale Gaudet","doi":"10.1186/s13326-025-00323-8","DOIUrl":"10.1186/s13326-025-00323-8","url":null,"abstract":"<p><strong>Background: </strong>There is a new framework from the United States government for screening synthetic nucleic acids. Beginning in October of 2026, it calls for the screening of sequences 50 nucleotides or greater in length that are known to contribute to pathogenicity or toxicity for humans, regardless of the taxa from which it originates. Distinguishing sequences that encode pathogenic and toxic functions from those that lack them is not simple.</p><p><strong>Objectives: </strong>Our project scope was to discern, describe, and catalog sequences involved in microbial pathogenesis from the scientific literature. We recognize a need for better terminology to designate pathogenic functions that are relevant across the entire range of existing parasites.</p><p><strong>Methods: </strong>We canvassed publications investigating microbial pathogens of humans, other animals, and some plants to collect thousands of sequences that enable the exploitation of hosts. We compared sequences to each other, grouping them according to what host biological processes they subvert and the consequence(s) for the host. We developed terms to capture many of the varied pathogenic functions for sequences employed by parasitic microbes for host exploitation and applied these terms in a systematic manner to our dataset of sequences.</p><p><strong>Results/conclusions: </strong>The enhanced and expanded terms enable a quick and pertinent evaluation of a sequence's ability to endow a microbe with pathogenic function when they are appropriately applied to relevant sequences. This will allow providers of synthetic nucleic acids to rapidly assess sequences ordered by their customers for pathogenic capacity. This will help fulfill the new US government guidance.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"4"},"PeriodicalIF":2.0,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11927349/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143669953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-20DOI: 10.1186/s13326-025-00328-3
Yiyuan Pu, Daniel Beck, Karin Verspoor
Background: In Literature-based Discovery (LBD), Swanson's original ABC model brought together isolated public knowledge statements and assembled them to infer putative hypotheses via logical connections. Modern LBD studies that scale up this approach through automation typically rely on a simple entity-based knowledge graph with co-occurrences and/or semantic triples as basic building blocks. However, our analysis of a knowledge graph constructed for a recent LBD system reveals limitations arising from such pairwise representations, which further negatively impact knowledge inference. Using LBD as the context and motivation in this work, we explore limitations of using pairwise relationships only as knowledge representation in knowledge graphs, and we identify impacts of these limitations on knowledge inference. We argue that enhanced knowledge representation is beneficial for biological knowledge representation in general, as well as for both the quality and the specificity of hypotheses proposed with LBD.
Results: Based on a systematic analysis of one co-occurrence-based LBD system focusing on Alzheimer's Disease, we identify 7 types of limitations arising from the exclusive use of pairwise relationships in a standard knowledge graph-including the need to capture more than two entities interacting together in a single event-and 3 types of negative impacts on knowledge inferred with the graph-Experimentally infeasible hypotheses, Literature-inconsistent hypotheses, and Oversimplified hypotheses explanations. We also present an indicative distribution of different types of relationships. Pairwise relationships are an essential component in representation frameworks for knowledge discovery. However, only 20% of discoveries are perfectly represented with pairwise relationships alone. 73% require a combination of pairwise relationships and nested relationships. The remaining 7% are represented with pairwise relationships, nested relationships, and hypergraphs.
Conclusion: We argue that the standard entity pair-based knowledge graph, while essential for representing basic binary relations, results in important limitations for comprehensive biological knowledge representation and impacts downstream tasks such as proposing meaningful discoveries in LBD. These limitations can be mitigated by integrating more semantically complex knowledge representation strategies, including capturing collective interactions and allowing for nested entities. The use of more sophisticated knowledge representation will benefit biological fields with more expressive knowledge graphs. Downstream tasks, such as LBD, can benefit from richer representations as well, allowing for generation of implicit knowledge discoveries and explanations for disease diagnosis, treatment, and mechanism that are more biologically meaningful.
{"title":"Enriched knowledge representation in biological fields: a case study of literature-based discovery in Alzheimer's disease.","authors":"Yiyuan Pu, Daniel Beck, Karin Verspoor","doi":"10.1186/s13326-025-00328-3","DOIUrl":"10.1186/s13326-025-00328-3","url":null,"abstract":"<p><strong>Background: </strong>In Literature-based Discovery (LBD), Swanson's original ABC model brought together isolated public knowledge statements and assembled them to infer putative hypotheses via logical connections. Modern LBD studies that scale up this approach through automation typically rely on a simple entity-based knowledge graph with co-occurrences and/or semantic triples as basic building blocks. However, our analysis of a knowledge graph constructed for a recent LBD system reveals limitations arising from such pairwise representations, which further negatively impact knowledge inference. Using LBD as the context and motivation in this work, we explore limitations of using pairwise relationships only as knowledge representation in knowledge graphs, and we identify impacts of these limitations on knowledge inference. We argue that enhanced knowledge representation is beneficial for biological knowledge representation in general, as well as for both the quality and the specificity of hypotheses proposed with LBD.</p><p><strong>Results: </strong>Based on a systematic analysis of one co-occurrence-based LBD system focusing on Alzheimer's Disease, we identify 7 types of limitations arising from the exclusive use of pairwise relationships in a standard knowledge graph-including the need to capture more than two entities interacting together in a single event-and 3 types of negative impacts on knowledge inferred with the graph-Experimentally infeasible hypotheses, Literature-inconsistent hypotheses, and Oversimplified hypotheses explanations. We also present an indicative distribution of different types of relationships. Pairwise relationships are an essential component in representation frameworks for knowledge discovery. However, only 20% of discoveries are perfectly represented with pairwise relationships alone. 73% require a combination of pairwise relationships and nested relationships. The remaining 7% are represented with pairwise relationships, nested relationships, and hypergraphs.</p><p><strong>Conclusion: </strong>We argue that the standard entity pair-based knowledge graph, while essential for representing basic binary relations, results in important limitations for comprehensive biological knowledge representation and impacts downstream tasks such as proposing meaningful discoveries in LBD. These limitations can be mitigated by integrating more semantically complex knowledge representation strategies, including capturing collective interactions and allowing for nested entities. The use of more sophisticated knowledge representation will benefit biological fields with more expressive knowledge graphs. Downstream tasks, such as LBD, can benefit from richer representations as well, allowing for generation of implicit knowledge discoveries and explanations for disease diagnosis, treatment, and mechanism that are more biologically meaningful.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"3"},"PeriodicalIF":2.0,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924609/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143669945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-08DOI: 10.1186/s13326-025-00325-6
Rita T Sousa, Heiko Paulheim
Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the number of patients in expression datasets is usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel approach to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration, and to learn uniform patient representations for subjects contained in different incompatible datasets. Different strategies and KG embedding methods are explored to generate vector representations, serving as inputs for a classifier. Extensive experiments demonstrate the efficacy of our approach, revealing weighted F1-score improvements in diabetes prediction up to 13% when integrating multiple gene expression datasets and domain-specific knowledge about protein functions and interactions.
{"title":"Gene expression knowledge graph for patient representation and diabetes prediction.","authors":"Rita T Sousa, Heiko Paulheim","doi":"10.1186/s13326-025-00325-6","DOIUrl":"10.1186/s13326-025-00325-6","url":null,"abstract":"<p><p>Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the number of patients in expression datasets is usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel approach to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration, and to learn uniform patient representations for subjects contained in different incompatible datasets. Different strategies and KG embedding methods are explored to generate vector representations, serving as inputs for a classifier. Extensive experiments demonstrate the efficacy of our approach, revealing weighted F1-score improvements in diabetes prediction up to 13% when integrating multiple gene expression datasets and domain-specific knowledge about protein functions and interactions.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"2"},"PeriodicalIF":2.0,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11889825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143585774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: TogoID ( https://togoid.dbcls.jp/ ) is an identifier (ID) conversion service designed to link IDs across diverse categories of life science databases. With its ability to obtain IDs related in different semantic relationships, a user-friendly web interface, and a regular automatic data update system, TogoID has been a valuable tool for bioinformatics.
Results: We have recently expanded TogoID's ability to represent semantics between datasets, enabling it to handle multiple semantic relationships within dataset pairs. This enhancement enables TogoID to distinguish relationships such as "glycans bind to proteins" or "glycans are processed by proteins" between glycans and proteins. Additional new features include the ability to display labels corresponding to database IDs, making it easier to interpret the relationships between the various IDs available in TogoID, and the ability to convert labels to IDs, extending the entry point for ID conversion. The implementation of URL parameters, which reproduces the state of TogoID's web application, allows users to share complex search results through a simple URL.
Conclusions: These advancements improve TogoID's utility in bioinformatics, allowing researchers to explore complex ID relationships. By introducing the tool's multi-semantic and label features, TogoID expands the concept of ID conversion and supports more comprehensive and efficient data integration across life science databases.
{"title":"Expanding the concept of ID conversion in TogoID by introducing multi-semantic and label features.","authors":"Shuya Ikeda, Kiyoko F Aoki-Kinoshita, Hirokazu Chiba, Susumu Goto, Masae Hosoda, Shuichi Kawashima, Jin-Dong Kim, Yuki Moriya, Tazro Ohta, Hiromasa Ono, Terue Takatsuki, Yasunori Yamamoto, Toshiaki Katayama","doi":"10.1186/s13326-024-00322-1","DOIUrl":"10.1186/s13326-024-00322-1","url":null,"abstract":"<p><strong>Background: </strong>TogoID ( https://togoid.dbcls.jp/ ) is an identifier (ID) conversion service designed to link IDs across diverse categories of life science databases. With its ability to obtain IDs related in different semantic relationships, a user-friendly web interface, and a regular automatic data update system, TogoID has been a valuable tool for bioinformatics.</p><p><strong>Results: </strong>We have recently expanded TogoID's ability to represent semantics between datasets, enabling it to handle multiple semantic relationships within dataset pairs. This enhancement enables TogoID to distinguish relationships such as \"glycans bind to proteins\" or \"glycans are processed by proteins\" between glycans and proteins. Additional new features include the ability to display labels corresponding to database IDs, making it easier to interpret the relationships between the various IDs available in TogoID, and the ability to convert labels to IDs, extending the entry point for ID conversion. The implementation of URL parameters, which reproduces the state of TogoID's web application, allows users to share complex search results through a simple URL.</p><p><strong>Conclusions: </strong>These advancements improve TogoID's utility in bioinformatics, allowing researchers to explore complex ID relationships. By introducing the tool's multi-semantic and label features, TogoID expands the concept of ID conversion and supports more comprehensive and efficient data integration across life science databases.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"1"},"PeriodicalIF":2.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11708180/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142948850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-28DOI: 10.1186/s13326-024-00321-2
Xiaofeng Liao, Thomas H A Ederveen, Anna Niehues, Casper de Visser, Junda Huang, Firdaws Badmus, Cenna Doornbos, Yuliia Orlova, Purva Kulkarni, K Joeri van der Velde, Morris A Swertz, Martin Brandt, Alain J van Gool, Peter A C 't Hoen
Motivation: We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. The integration of multi-omics data is challenging. Moreover, human multi-omics data may be privacy-sensitive and can be misused to de-anonymize and (re-)identify individuals. Hence, most biomedical data is kept in secure and protected silos. Therefore, it remains a challenge to re-use these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of Findable, Accessible, Interoperable, and Reusable (FAIR) data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.
Results: The Netherlands X-omics Initiative is a National Roadmap Large-Scale Research Infrastructure aiming for efficient integration of data generated within X-omics and external datasets. To facilitate this, we developed the FAIR Data Cube (FDCube), which adopts and applies the FAIR principles and helps researchers to create FAIR data and metadata, to facilitate re-use of their data, and to make their data analysis workflows transparent, and in the meantime ensure data security and privacy.
{"title":"FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis.","authors":"Xiaofeng Liao, Thomas H A Ederveen, Anna Niehues, Casper de Visser, Junda Huang, Firdaws Badmus, Cenna Doornbos, Yuliia Orlova, Purva Kulkarni, K Joeri van der Velde, Morris A Swertz, Martin Brandt, Alain J van Gool, Peter A C 't Hoen","doi":"10.1186/s13326-024-00321-2","DOIUrl":"10.1186/s13326-024-00321-2","url":null,"abstract":"<p><strong>Motivation: </strong>We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. The integration of multi-omics data is challenging. Moreover, human multi-omics data may be privacy-sensitive and can be misused to de-anonymize and (re-)identify individuals. Hence, most biomedical data is kept in secure and protected silos. Therefore, it remains a challenge to re-use these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of Findable, Accessible, Interoperable, and Reusable (FAIR) data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.</p><p><strong>Results: </strong>The Netherlands X-omics Initiative is a National Roadmap Large-Scale Research Infrastructure aiming for efficient integration of data generated within X-omics and external datasets. To facilitate this, we developed the FAIR Data Cube (FDCube), which adopts and applies the FAIR principles and helps researchers to create FAIR data and metadata, to facilitate re-use of their data, and to make their data analysis workflows transparent, and in the meantime ensure data security and privacy.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"20"},"PeriodicalIF":2.0,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681678/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142894524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-17DOI: 10.1186/s13326-024-00320-3
Sabrina Toro, Anna V Anagnostopoulos, Susan M Bello, Kai Blumberg, Rhiannon Cameron, Leigh Carmody, Alexander D Diehl, Damion M Dooley, William D Duncan, Petra Fey, Pascale Gaudet, Nomi L Harris, Marcin P Joachimiak, Leila Kiani, Tiago Lubiana, Monica C Munoz-Torres, Shawn O'Neil, David Osumi-Sutherland, Aleix Puig-Barbe, Justin T Reese, Leonore Reiser, Sofia Mc Robb, Troy Ruemping, James Seager, Eric Sid, Ray Stefancsik, Magalie Weber, Valerie Wood, Melissa A Haendel, Christopher J Mungall
Background: Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.
Results: We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues.
Conclusions: These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.
{"title":"Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI).","authors":"Sabrina Toro, Anna V Anagnostopoulos, Susan M Bello, Kai Blumberg, Rhiannon Cameron, Leigh Carmody, Alexander D Diehl, Damion M Dooley, William D Duncan, Petra Fey, Pascale Gaudet, Nomi L Harris, Marcin P Joachimiak, Leila Kiani, Tiago Lubiana, Monica C Munoz-Torres, Shawn O'Neil, David Osumi-Sutherland, Aleix Puig-Barbe, Justin T Reese, Leonore Reiser, Sofia Mc Robb, Troy Ruemping, James Seager, Eric Sid, Ray Stefancsik, Magalie Weber, Valerie Wood, Melissa A Haendel, Christopher J Mungall","doi":"10.1186/s13326-024-00320-3","DOIUrl":"10.1186/s13326-024-00320-3","url":null,"abstract":"<p><strong>Background: </strong>Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.</p><p><strong>Results: </strong>We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues.</p><p><strong>Conclusions: </strong>These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"19"},"PeriodicalIF":2.0,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484368/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-02DOI: 10.1186/s13326-024-00319-w
Houcemeddine Turki, Bonaventure F P Dossou, Chris Chinenye Emezue, Abraham Toluwase Owodunni, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Hanen Ben Hassen, Afif Masmoudi
Biomedical relation classification has been significantly improved by the application of advanced machine learning techniques on the raw texts of scholarly publications. Despite this improvement, the reliance on large chunks of raw text makes these algorithms suffer in terms of generalization, precision, and reliability. The use of the distinctive characteristics of bibliographic metadata can prove effective in achieving better performance for this challenging task. In this research paper, we introduce an approach for biomedical relation classification using the qualifiers of co-occurring Medical Subject Headings (MeSH). First of all, we introduce MeSH2Matrix, our dataset consisting of 46,469 biomedical relations curated from PubMed publications using our approach. Our dataset includes a matrix that maps associations between the qualifiers of subject MeSH keywords and those of object MeSH keywords. It also specifies the corresponding Wikidata relation type and the superclass of semantic relations for each relation. Using MeSH2Matrix, we build and train three machine learning models (Support Vector Machine [SVM], a dense model [D-Model], and a convolutional neural network [C-Net]) to evaluate the efficiency of our approach for biomedical relation classification. Our best model achieves an accuracy of 70.78% for 195 classes and 83.09% for five superclasses. Finally, we provide confusion matrix and extensive feature analyses to better examine the relationship between the MeSH qualifiers and the biomedical relations being classified. Our results will hopefully shed light on developing better algorithms for biomedical ontology classification based on the MeSH keywords of PubMed publications. For reproducibility purposes, MeSH2Matrix, as well as all our source codes, are made publicly accessible at https://github.com/SisonkeBiotik-Africa/MeSH2Matrix .
{"title":"MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed.","authors":"Houcemeddine Turki, Bonaventure F P Dossou, Chris Chinenye Emezue, Abraham Toluwase Owodunni, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Hanen Ben Hassen, Afif Masmoudi","doi":"10.1186/s13326-024-00319-w","DOIUrl":"10.1186/s13326-024-00319-w","url":null,"abstract":"<p><p>Biomedical relation classification has been significantly improved by the application of advanced machine learning techniques on the raw texts of scholarly publications. Despite this improvement, the reliance on large chunks of raw text makes these algorithms suffer in terms of generalization, precision, and reliability. The use of the distinctive characteristics of bibliographic metadata can prove effective in achieving better performance for this challenging task. In this research paper, we introduce an approach for biomedical relation classification using the qualifiers of co-occurring Medical Subject Headings (MeSH). First of all, we introduce MeSH2Matrix, our dataset consisting of 46,469 biomedical relations curated from PubMed publications using our approach. Our dataset includes a matrix that maps associations between the qualifiers of subject MeSH keywords and those of object MeSH keywords. It also specifies the corresponding Wikidata relation type and the superclass of semantic relations for each relation. Using MeSH2Matrix, we build and train three machine learning models (Support Vector Machine [SVM], a dense model [D-Model], and a convolutional neural network [C-Net]) to evaluate the efficiency of our approach for biomedical relation classification. Our best model achieves an accuracy of 70.78% for 195 classes and 83.09% for five superclasses. Finally, we provide confusion matrix and extensive feature analyses to better examine the relationship between the MeSH qualifiers and the biomedical relations being classified. Our results will hopefully shed light on developing better algorithms for biomedical ontology classification based on the MeSH keywords of PubMed publications. For reproducibility purposes, MeSH2Matrix, as well as all our source codes, are made publicly accessible at https://github.com/SisonkeBiotik-Africa/MeSH2Matrix .</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"18"},"PeriodicalIF":2.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11445994/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142361554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-15DOI: 10.1186/s13326-024-00316-z
Beata Fonferko-Shadrach, Huw Strafford, Carys Jones, Russell A. Khan, Sharon Brown, Jenny Edwards, Jonathan Hawken, Luke E. Shrimpton, Catharine P. White, Robert Powell, Inder M. S. Sawhney, William O. Pickrell, Arron S. Lacey
Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline. We created 200 synthetic clinic letters based on hospital outpatient consultations with epilepsy specialists. The letters were double annotated by trained clinicians and researchers according to agreed guidelines. We used the annotation tool, Markup, with an epilepsy concept list based on the Unified Medical Language System ontology. All annotations were reviewed, and a gold standard set of annotations was agreed and used to validate the performance of ExECTv2. The overall inter-annotator agreement (IAA) between the two sets of annotations produced a per item F1 score of 0.73. Validating ExECTv2 using the gold standard gave an overall F1 score of 0.87 per item, and 0.90 per letter. The synthetic letters, annotations, and annotation guidelines have been made freely available. To our knowledge, this is the first publicly available set of annotated epilepsy clinic letters and guidelines that can be used for NLP researchers with minimum epilepsy knowledge. The IAA results show that clinical text annotation tasks are difficult and require a gold standard to be arranged by researcher consensus. The results for ExECTv2, our automated epilepsy NLP pipeline, extracted detailed epilepsy information from unstructured epilepsy letters with more accuracy than human annotators, further confirming the utility of NLP for clinical and research applications.
{"title":"Annotation of epilepsy clinic letters for natural language processing","authors":"Beata Fonferko-Shadrach, Huw Strafford, Carys Jones, Russell A. Khan, Sharon Brown, Jenny Edwards, Jonathan Hawken, Luke E. Shrimpton, Catharine P. White, Robert Powell, Inder M. S. Sawhney, William O. Pickrell, Arron S. Lacey","doi":"10.1186/s13326-024-00316-z","DOIUrl":"https://doi.org/10.1186/s13326-024-00316-z","url":null,"abstract":"Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline. We created 200 synthetic clinic letters based on hospital outpatient consultations with epilepsy specialists. The letters were double annotated by trained clinicians and researchers according to agreed guidelines. We used the annotation tool, Markup, with an epilepsy concept list based on the Unified Medical Language System ontology. All annotations were reviewed, and a gold standard set of annotations was agreed and used to validate the performance of ExECTv2. The overall inter-annotator agreement (IAA) between the two sets of annotations produced a per item F1 score of 0.73. Validating ExECTv2 using the gold standard gave an overall F1 score of 0.87 per item, and 0.90 per letter. The synthetic letters, annotations, and annotation guidelines have been made freely available. To our knowledge, this is the first publicly available set of annotated epilepsy clinic letters and guidelines that can be used for NLP researchers with minimum epilepsy knowledge. The IAA results show that clinical text annotation tasks are difficult and require a gold standard to be arranged by researcher consensus. The results for ExECTv2, our automated epilepsy NLP pipeline, extracted detailed epilepsy information from unstructured epilepsy letters with more accuracy than human annotators, further confirming the utility of NLP for clinical and research applications.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"36 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}