Pub Date : 2024-10-01DOI: 10.1093/database/baae110
{"title":"Correction to: The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.","authors":"","doi":"10.1093/database/baae110","DOIUrl":"https://doi.org/10.1093/database/baae110","url":null,"abstract":"","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142364755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-27DOI: 10.1093/database/baae093
Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.
先进的分子生物学技术和设施每天都会产生大量有价值的数据;然而,这些数据往往缺乏研究人员有效查找、整理和搜索所需的基本元数据。元数据的缺乏给这些数据集的利用带来了巨大挑战。基于机器学习(ML)的元数据提取技术已经成为自动为科学数据集标注有效搜索所需元数据的潜在可行方法。文本标注通常由人工完成,在验证机器提取的元数据方面起着至关重要的作用。然而,人工标注既耗时又不一定可行;因此,有必要开发自动文本标注技术,以加快科学创新的进程。这一需求在环境基因组学和微生物组科学等领域尤为迫切,这些领域在元数据整理和创建黄金标准文本挖掘数据集方面历来受到的关注较少。在本文中,我们介绍了两种新颖的自动文本标注方法,用于验证人工智能生成的未标注文本元数据,具体应用于环境基因组学。我们的技术展示了两种新方法的潜力,即利用仅适用于语料库中特定文档的现有信息来验证 ML 模型,然后用这些信息来描述语料库中的其余文档。第一种技术是利用与同一研究相关的不同类型数据源之间的关系,如出版物和提案。第二种技术利用的是特定领域的受控词汇表或本体。在本文中,我们详细介绍了在环境基因组学研究中应用这些方法对人工智能生成的元数据进行验证的情况。我们的研究结果表明,所提出的标签分配方法既能为无标签文本生成通用的文本标签,也能生成高度特定的文本标签,其中高达 44% 的标签与人工智能关键词提取算法所建议的标签相匹配。
{"title":"Automated annotation of scientific texts for ML-based keyphrase extraction and validation.","authors":"Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan","doi":"10.1093/database/baae093","DOIUrl":"https://doi.org/10.1093/database/baae093","url":null,"abstract":"<p><p>Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142343320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Personalized medicine tailors treatments and dosages based on a patient's unique characteristics, particularly its genetic profile. Over the decades, stratified research and clinical trials have uncovered crucial drug-related information-such as dosage, effectiveness, and side effects-affecting specific individuals with particular genetic backgrounds. This genetic-specific knowledge, characterized by complex multirelationships and conditions, cannot be adequately represented or stored in conventional knowledge systems. To address these challenges, we developed CPMKG, a condition-based platform that enables comprehensive knowledge representation. Through information extraction and meticulous curation, we compiled 307 614 knowledge entries, encompassing thousands of drugs, diseases, phenotypes (complications/side effects), genes, and genomic variations across four key categories: drug side effects, drug sensitivity, drug mechanisms, and drug indications. CPMKG facilitates drug-centric exploration and enables condition-based multiknowledge inference, accelerating knowledge discovery through three pivotal applications. To enhance user experience, we seamlessly integrated a sophisticated large language model that provides textual interpretations for each subgraph, bridging the gap between structured graphs and language expressions. With its comprehensive knowledge graph and user-centric applications, CPMKG serves as a valuable resource for clinical research, offering drug information tailored to personalized genetic profiles, syndromes, and phenotypes. Database URL: https://www.biosino.org/cpmkg/.
{"title":"CPMKG: a condition-based knowledge graph for precision medicine.","authors":"Jiaxin Yang, Xinhao Zhuang, Zhenqi Li, Gang Xiong, Ping Xu, Yunchao Ling, Guoqing Zhang","doi":"10.1093/database/baae102","DOIUrl":"https://doi.org/10.1093/database/baae102","url":null,"abstract":"<p><p>Personalized medicine tailors treatments and dosages based on a patient's unique characteristics, particularly its genetic profile. Over the decades, stratified research and clinical trials have uncovered crucial drug-related information-such as dosage, effectiveness, and side effects-affecting specific individuals with particular genetic backgrounds. This genetic-specific knowledge, characterized by complex multirelationships and conditions, cannot be adequately represented or stored in conventional knowledge systems. To address these challenges, we developed CPMKG, a condition-based platform that enables comprehensive knowledge representation. Through information extraction and meticulous curation, we compiled 307 614 knowledge entries, encompassing thousands of drugs, diseases, phenotypes (complications/side effects), genes, and genomic variations across four key categories: drug side effects, drug sensitivity, drug mechanisms, and drug indications. CPMKG facilitates drug-centric exploration and enables condition-based multiknowledge inference, accelerating knowledge discovery through three pivotal applications. To enhance user experience, we seamlessly integrated a sophisticated large language model that provides textual interpretations for each subgraph, bridging the gap between structured graphs and language expressions. With its comprehensive knowledge graph and user-centric applications, CPMKG serves as a valuable resource for clinical research, offering drug information tailored to personalized genetic profiles, syndromes, and phenotypes. Database URL: https://www.biosino.org/cpmkg/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11429523/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142343321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peptihub (https://bioinformaticscollege.ir/peptihub/) is a meticulously curated repository of cancer-related peptides (CRPs) that have been documented in scientific literature. A diverse collection of CRPs is included in the PeptiHub, showcasing a spectrum of effects and activities. While some peptides demonstrated significant anticancer efficacy, others exhibited no discernible impact, and some even possessed alternative non-drug functionalities, including drug carrier or carcinogenic attributes. Presently, Peptihub houses 874 CRPs, subjected to evaluation across 10 distinct organism categories, 26 organs, and 438 cell lines. Each entry in the database is accompanied by easily accessible 3D conformations, obtained either experimentally or through predictive methodology. Users are provided with three search frameworks offering basic, advanced, and BLAST sequence search options. Furthermore, precise annotations of peptides enable users to explore CRPs based on their specific activities (anticancer, no effect, insignificant effect, carcinogen, and others) and their effectiveness (rate and IC50) under cancer conditions, specifically within individual organs. This unique property facilitates the construction of robust training and testing datasets. Additionally, PeptiHub offers 1141 features with the convenience of selecting the most pertinent features to address their specific research questions. Features include aaindex1 (in six main subcategories: alpha propensities, beta propensity, composition indices, hydrophobicity, physicochemical properties, and other properties), amino acid composition (Amino acid Composition and Dipeptide Composition), and Grouped Amino Acid Composition (Grouped amino acid composition, Grouped dipeptide composition, and Conjoint triad) categories. These utilities not only speed up machine learning-based peptide design but also facilitate peptide classification. Database URL: https://bioinformaticscollege.ir/peptihub/.
{"title":"PeptiHub: a curated repository of precisely annotated cancer-related peptides with advanced utilities for peptide exploration and discovery.","authors":"Sara Zareei, Babak Khorsand, Alireza Dantism, Neda Zareei, Fereshteh Asgharzadeh, Shadi Shams Zahraee, Samane Mashreghi Kashan, Shirin Hekmatirad, Shila Amini, Fatemeh Ghasemi, Maryam Moradnia, Atena Vaghf, Anahid Hemmatpour, Hamdam Hourfar, Soudabeh Niknia, Ali Johari, Fatemeh Salimi, Neda Fariborzi, Zohreh Shojaei, Elaheh Asiaei, Hossein Shabani","doi":"10.1093/database/baae092","DOIUrl":"10.1093/database/baae092","url":null,"abstract":"<p><p>Peptihub (https://bioinformaticscollege.ir/peptihub/) is a meticulously curated repository of cancer-related peptides (CRPs) that have been documented in scientific literature. A diverse collection of CRPs is included in the PeptiHub, showcasing a spectrum of effects and activities. While some peptides demonstrated significant anticancer efficacy, others exhibited no discernible impact, and some even possessed alternative non-drug functionalities, including drug carrier or carcinogenic attributes. Presently, Peptihub houses 874 CRPs, subjected to evaluation across 10 distinct organism categories, 26 organs, and 438 cell lines. Each entry in the database is accompanied by easily accessible 3D conformations, obtained either experimentally or through predictive methodology. Users are provided with three search frameworks offering basic, advanced, and BLAST sequence search options. Furthermore, precise annotations of peptides enable users to explore CRPs based on their specific activities (anticancer, no effect, insignificant effect, carcinogen, and others) and their effectiveness (rate and IC50) under cancer conditions, specifically within individual organs. This unique property facilitates the construction of robust training and testing datasets. Additionally, PeptiHub offers 1141 features with the convenience of selecting the most pertinent features to address their specific research questions. Features include aaindex1 (in six main subcategories: alpha propensities, beta propensity, composition indices, hydrophobicity, physicochemical properties, and other properties), amino acid composition (Amino acid Composition and Dipeptide Composition), and Grouped Amino Acid Composition (Grouped amino acid composition, Grouped dipeptide composition, and Conjoint triad) categories. These utilities not only speed up machine learning-based peptide design but also facilitate peptide classification. Database URL: https://bioinformaticscollege.ir/peptihub/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417155/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1093/database/baae088
Neha, Jesu Castin, Saman Fatihi, Deepanshi Gahlot, Akanksha Arun, Lipi Thukral
Autophagy pathway plays a central role in cellular degradation. The proteins involved in the core autophagy process are mostly localised on membranes or interact indirectly with lipid-associated proteins. Therefore, progress in structure determination of 'core autophagy proteins' remained relatively limited. Recent paradigm shift in structural biology that includes cutting-edge cryo-EM technology and robust AI-based Alphafold2 predicted models has significantly increased data points in biology. Here, we developed Autophagy3D, a web-based resource that provides an efficient way to access data associated with 40 core human autophagic proteins (80322 structures), their protein-protein interactors and ortholog structures from various species. Autophagy3D also offers detailed visualizations of protein structures, and, hence deriving direct biological insights. The database significantly enhances access to information as full datasets are available for download. The Autophagy3D can be publicly accessed via https://autophagy3d.igib.res.in. Database URL: https://autophagy3d.igib.res.in.
{"title":"Autophagy3D: a comprehensive autophagy structure database.","authors":"Neha, Jesu Castin, Saman Fatihi, Deepanshi Gahlot, Akanksha Arun, Lipi Thukral","doi":"10.1093/database/baae088","DOIUrl":"10.1093/database/baae088","url":null,"abstract":"<p><p>Autophagy pathway plays a central role in cellular degradation. The proteins involved in the core autophagy process are mostly localised on membranes or interact indirectly with lipid-associated proteins. Therefore, progress in structure determination of 'core autophagy proteins' remained relatively limited. Recent paradigm shift in structural biology that includes cutting-edge cryo-EM technology and robust AI-based Alphafold2 predicted models has significantly increased data points in biology. Here, we developed Autophagy3D, a web-based resource that provides an efficient way to access data associated with 40 core human autophagic proteins (80322 structures), their protein-protein interactors and ortholog structures from various species. Autophagy3D also offers detailed visualizations of protein structures, and, hence deriving direct biological insights. The database significantly enhances access to information as full datasets are available for download. The Autophagy3D can be publicly accessed via https://autophagy3d.igib.res.in. Database URL: https://autophagy3d.igib.res.in.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11412239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1093/database/baae091
Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki
Tens of thousands of influenza sequences are deposited into the GenBank database each year. The software tool FLu ANnotation tool (FLAN) has been used by GenBank since 2007 to validate and annotate incoming influenza sequence submissions and has been publicly available as a webserver but not as a standalone tool. Viral Annotation DefineR (VADR) is a general sequence validation and annotation software package used by GenBank for norovirus, dengue virus and SARS-CoV-2 virus sequence processing that is available as a standalone tool. We have created VADR influenza models based on the FLAN reference sequences and adapted VADR to accurately annotate influenza sequences. VADR and FLAN show consistent results on the vast majority of influenza sequences, and when they disagree, VADR is usually correct. VADR can also accurately process influenza D sequences as well as influenza A H17, H18, H19, N10 and N11 subtype sequences, which FLAN cannot. VADR 1.6.3 and the associated influenza models are now freely available for users to download and use. Database URL: https://bitbucket.org/nawrockie/vadr-models-flu.
{"title":"Influenza sequence validation and annotation using VADR.","authors":"Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki","doi":"10.1093/database/baae091","DOIUrl":"10.1093/database/baae091","url":null,"abstract":"<p><p>Tens of thousands of influenza sequences are deposited into the GenBank database each year. The software tool FLu ANnotation tool (FLAN) has been used by GenBank since 2007 to validate and annotate incoming influenza sequence submissions and has been publicly available as a webserver but not as a standalone tool. Viral Annotation DefineR (VADR) is a general sequence validation and annotation software package used by GenBank for norovirus, dengue virus and SARS-CoV-2 virus sequence processing that is available as a standalone tool. We have created VADR influenza models based on the FLAN reference sequences and adapted VADR to accurately annotate influenza sequences. VADR and FLAN show consistent results on the vast majority of influenza sequences, and when they disagree, VADR is usually correct. VADR can also accurately process influenza D sequences as well as influenza A H17, H18, H19, N10 and N11 subtype sequences, which FLAN cannot. VADR 1.6.3 and the associated influenza models are now freely available for users to download and use. Database URL: https://bitbucket.org/nawrockie/vadr-models-flu.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11411204/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1093/database/baae098
Yan Pan, Zijing Gao, Xuejian Cui, Zhen Li, Rui Jiang
Cell-cell communication (CCC) through ligand-receptor (L-R) pairs forms the cornerstone for complex functionalities in multicellular organisms. Deciphering such intercellular signaling can contribute to unraveling disease mechanisms and enable targeted therapy. Nonetheless, notable biases and inconsistencies are evident among the inferential outcomes generated by current methods for inferring CCC network. To fill this gap, we developed collectNET (http://health.tsinghua.edu.cn/collectnet) as a comprehensive web platform for analyzing CCC network, with efficient calculation, hierarchical browsing, comprehensive statistics, advanced searching, and intuitive visualization. collectNET provides a reliable online inference service with prior knowledge of three public L-R databases and systematic integration of three mainstream inference methods. Additionally, collectNET has assembled a human CCC atlas, including 126 785 significant communication pairs based on 343 023 cells. We anticipate that collectNET will benefit researchers in gaining a more holistic understanding of cell development and differentiation mechanisms. Database URL: http://health.tsinghua.edu.cn/collectnet.
{"title":"collectNET: a web server for integrated inference of cell-cell communication network.","authors":"Yan Pan, Zijing Gao, Xuejian Cui, Zhen Li, Rui Jiang","doi":"10.1093/database/baae098","DOIUrl":"https://doi.org/10.1093/database/baae098","url":null,"abstract":"<p><p>Cell-cell communication (CCC) through ligand-receptor (L-R) pairs forms the cornerstone for complex functionalities in multicellular organisms. Deciphering such intercellular signaling can contribute to unraveling disease mechanisms and enable targeted therapy. Nonetheless, notable biases and inconsistencies are evident among the inferential outcomes generated by current methods for inferring CCC network. To fill this gap, we developed collectNET (http://health.tsinghua.edu.cn/collectnet) as a comprehensive web platform for analyzing CCC network, with efficient calculation, hierarchical browsing, comprehensive statistics, advanced searching, and intuitive visualization. collectNET provides a reliable online inference service with prior knowledge of three public L-R databases and systematic integration of three mainstream inference methods. Additionally, collectNET has assembled a human CCC atlas, including 126 785 significant communication pairs based on 343 023 cells. We anticipate that collectNET will benefit researchers in gaining a more holistic understanding of cell development and differentiation mechanisms. Database URL: http://health.tsinghua.edu.cn/collectnet.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11403813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1093/database/baae101
Ander Martinez, Nuria García-Santa
This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.
本文是对我们提交的论文(Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track:命名实体识别 Zenodo.)提交给 "BioCreative 2023 "的 "SympTEMIST "命名实体识别(NER)共享子任务。我们参加了这项挑战,提交了两个基于 RoBERTa 架构 LLM 的系统,该 LLM 在 "HuggingFace "模型库中的西班牙语临床数据上进行了训练。在选择提交的系统之前,我们尝试了本文所述技术的不同组合:条件随机场和字节对编码剔除。在第二个系统中,我们还加入了基于子分词特征的嵌入(SSW)。挑战赛中使用的测试集现已发布(López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus:用于临床症状、体征和检查结果信息提取的黄金标准注释。Zenodo),让我们能够更深入地分析我们的方法,并衡量引入 CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: 西班牙语临床实体注释指南。Zenodo)语料库。我们的实验表明,使用基于 Sub-Subword 特征的嵌入效果适中,而纳入 CARMEN-I 数据集的症状 NER 数据则会产生影响。数据库网址:https://physionet.org/content/carmen-i/1.0/.
{"title":"An analysis of FRE @ BC8 SympTEMIST track: named entity recognition.","authors":"Ander Martinez, Nuria García-Santa","doi":"10.1093/database/baae101","DOIUrl":"https://doi.org/10.1093/database/baae101","url":null,"abstract":"<p><p>This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11403810/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.
在生物医学文本挖掘领域,从文献中提取关系的能力对于推进理论研究和实际应用都至关重要。目前,旨在加强多种类型关系提取的语料库明显不足,尤其是针对蛋白质和含蛋白质实体(如复合物和族)以及化学物质的语料库。在这项工作中,我们提出了 RegulaTome,它是一个克服了现有几个生物医学关系提取(RE)语料库局限性的语料库,其中许多语料库都集中在句子层面的单一类型关系上。RegulaTome 通过在超过 2500 篇文档中提供 16 961 种关系注释而脱颖而出,成为迄今为止同类数据中最广泛的数据集。该语料库专门设计用于涵盖超过 40 种关系类型,超出了传统的探索范围,为生物医学 RE 任务的复杂性和深度树立了新的标杆。我们的语料库既扩大了检测关系的范围,又使 RE 达到了显著的准确性。在该语料库上训练的基于转换器的模型在如此复杂的任务中表现出了令人满意的 F1 分数(66.6%),这突出表明了我们的方法在准确识别和分类各种生物关系方面的有效性。这一成就彰显了 RegulaTome 的潜力,它将为开发更复杂、更高效、更准确的 RE 系统以解决生物医学任务做出重大贡献。最后,在所有 PubMed 摘要和 PMC Open Access 全文文档上运行训练有素的 RE 系统后,从整个生物医学文献中提取了超过 1800 万条关系。
{"title":"RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature.","authors":"Katerina Nastou, Farrokh Mehryary, Tomoko Ohta, Jouni Luoma, Sampo Pyysalo, Lars Juhl Jensen","doi":"10.1093/database/baae095","DOIUrl":"10.1093/database/baae095","url":null,"abstract":"<p><p>In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11394941/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-11DOI: 10.1093/database/baae090
Sylvia Vassileva, Georgi Grazhdanski, Ivan Koychev, Svetla Boytcheva
This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask. Database URL: https://github.com/svassileva/symptemist-multilingual-linking
{"title":"Transformer-based approach for symptom recognition and multilingual linking","authors":"Sylvia Vassileva, Georgi Grazhdanski, Ivan Koychev, Svetla Boytcheva","doi":"10.1093/database/baae090","DOIUrl":"https://doi.org/10.1093/database/baae090","url":null,"abstract":"This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask. Database URL: https://github.com/svassileva/symptemist-multilingual-linking","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"69 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142204871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}