首页 > 最新文献

Database: The Journal of Biological Databases and Curation最新文献

英文 中文
Recognition and normalization of multilingual symptom entities using in-domain-adapted BERT models and classification layers. 使用域内适应性 BERT 模型和分类层对多语言症状实体进行识别和规范化。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-28 DOI: 10.1093/database/baae087
Fernando Gallego, Francisco J Veredas

Due to the scarcity of available annotations in the biomedical domain, clinical natural language processing poses a substantial challenge, especially when applied to low-resource languages. This paper presents our contributions for the detection and normalization of clinical entities corresponding to symptoms, signs, and findings present in multilingual clinical texts. For this purpose, the three subtasks proposed in the SympTEMIST shared task of the Biocreative VIII conference have been addressed. For Subtask 1-named entity recognition in a Spanish corpus-an approach focused on BERT-based model assemblies pretrained on a proprietary oncology corpus was followed. Subtasks 2 and 3 of SympTEMIST address named entity linking (NEL) in Spanish and multilingual corpora, respectively. Our approach to these subtasks followed a classification strategy that starts from a bi-encoder trained by contrastive learning, for which several SapBERT-like models are explored. To apply this NEL approach to different languages, we have trained these models by leveraging the knowledge base of domain-specific medical concepts in Spanish supplied by the organizers, which we have translated into the other languages of interest by using machine translation tools. The results obtained in the three subtasks establish a new state of the art. Thus, for Subtask 1 we obtain precision results of 0.804, F1-score of 0.748, and recall of 0.699. For Subtask 2, we obtain performance gains of up to 5.5% in top-1 accuracy when the trained bi-encoder is followed by a WNT-softmax classification layer that is initialized with the mean of the embeddings of a subset of SNOMED-CT terms. For Subtask 3, the differences are even more pronounced, and our multilingual bi-encoder outperforms the other models analyzed in all languages except Swedish when combined with a WNT-softmax classification layer. Thus, the improvements in top-1 accuracy over the best bi-encoder model alone are 13% for Portuguese and 13.26% for Swedish. Database URL: https://doi.org/10.1093/database/baae087.

由于生物医学领域可用注释的匮乏,临床自然语言处理面临着巨大的挑战,尤其是在应用于低资源语言时。本文介绍了我们在多语言临床文本中与症状、体征和检查结果相对应的临床实体的检测和规范化方面做出的贡献。为此,我们讨论了第八届生物创新大会 SympTEMIST 共享任务中提出的三个子任务。对于子任务 1--西班牙文语料库中的命名实体识别,采用的方法是基于 BERT 的模型集合,并在专有肿瘤学语料库中进行了预训练。SympTEMIST 的子任务 2 和 3 分别涉及西班牙语和多语言语料库中的命名实体连接 (NEL)。我们在这些子任务中采用的分类策略是从对比学习训练的双编码器开始的,并为此探索了几种类似于 SapBERT 的模型。为了将这种 NEL 方法应用于不同的语言,我们利用主办方提供的西班牙语特定领域医学概念知识库来训练这些模型,并使用机器翻译工具将其翻译成其他相关语言。在三个子任务中获得的结果确立了新的技术水平。因此,在子任务 1 中,我们获得了 0.804 的精确度、0.748 的 F1 分数和 0.699 的召回率。对于子任务 2,如果在训练好的双编码器之后再加上一个以 SNOMED-CT 术语子集嵌入平均值为初始化的 WNT-softmax 分类层,我们就能获得高达 5.5% 的 top-1 准确率。在子任务 3 中,差异更加明显,我们的多语言双编码器在与 WNT-softmax 分类层相结合时,除瑞典语外,在所有语言中的表现都优于其他分析模型。因此,与单独的最佳双编码器模型相比,葡萄牙语和瑞典语的top-1准确率分别提高了13%和13.26%。数据库网址:https://doi.org/10.1093/database/baae087.
{"title":"Recognition and normalization of multilingual symptom entities using in-domain-adapted BERT models and classification layers.","authors":"Fernando Gallego, Francisco J Veredas","doi":"10.1093/database/baae087","DOIUrl":"10.1093/database/baae087","url":null,"abstract":"<p><p>Due to the scarcity of available annotations in the biomedical domain, clinical natural language processing poses a substantial challenge, especially when applied to low-resource languages. This paper presents our contributions for the detection and normalization of clinical entities corresponding to symptoms, signs, and findings present in multilingual clinical texts. For this purpose, the three subtasks proposed in the SympTEMIST shared task of the Biocreative VIII conference have been addressed. For Subtask 1-named entity recognition in a Spanish corpus-an approach focused on BERT-based model assemblies pretrained on a proprietary oncology corpus was followed. Subtasks 2 and 3 of SympTEMIST address named entity linking (NEL) in Spanish and multilingual corpora, respectively. Our approach to these subtasks followed a classification strategy that starts from a bi-encoder trained by contrastive learning, for which several SapBERT-like models are explored. To apply this NEL approach to different languages, we have trained these models by leveraging the knowledge base of domain-specific medical concepts in Spanish supplied by the organizers, which we have translated into the other languages of interest by using machine translation tools. The results obtained in the three subtasks establish a new state of the art. Thus, for Subtask 1 we obtain precision results of 0.804, F1-score of 0.748, and recall of 0.699. For Subtask 2, we obtain performance gains of up to 5.5% in top-1 accuracy when the trained bi-encoder is followed by a WNT-softmax classification layer that is initialized with the mean of the embeddings of a subset of SNOMED-CT terms. For Subtask 3, the differences are even more pronounced, and our multilingual bi-encoder outperforms the other models analyzed in all languages except Swedish when combined with a WNT-softmax classification layer. Thus, the improvements in top-1 accuracy over the best bi-encoder model alone are 13% for Portuguese and 13.26% for Swedish. Database URL: https://doi.org/10.1093/database/baae087.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11352596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142085938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach. 整合深度学习架构以增强生物医学关系提取:一种流水线方法。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-28 DOI: 10.1093/database/baae079
M Janina Sarol, Gibong Hong, Evan Guerra, Halil Kilicoglu

Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.

从科学出版物中提取生物医学关系是生物医学自然语言处理(NLP)中的一项关键任务,可以促进大型知识库的创建、提高知识发现的效率并加快证据合成。在本文中,我们以之前在 BioCreative VIII BioRED Track 上所做的努力为基础,提出了一种用于生物医学关系提取(RE)和新颖性检测(ND)的增强型端到端流水线方法,该方法有效地利用了现有数据集,并集成了最先进的深度学习方法。我们的管道包括依次执行的四项任务:命名实体识别(NER)、实体链接(EL)、RE 和 ND。我们使用 BioRED 基准语料库训练模型,该语料库是共享任务的基础。我们为每项任务探索了几种方法及其组合:对于 NER,我们比较了基于 BERT 的序列标注模型(使用 BIO 方案)和跨度分类模型。对于 EL,我们为疾病和化学品训练了一个卷积神经网络模型,并使用现有工具 PubTator 3.0 来映射其他实体类型。对于 RE 和 ND,我们将基于 BERT 的句子绑定 PURE 模型调整为双向和文档级提取。我们还进行了大量超参数调整,以提高模型性能。我们在 NER、RE 和 ND 中使用了基于 BERT 的模型,在 EL 中使用了混合方法,从而获得了最佳性能。与我们提交的共享任务相比,我们的增强和优化管道显示出了实质性的改进:NER:93.53(+3.09);EL:83.87(+9.73);RE:46.18(+15.67);ND:38.86(+14.9)。虽然 NER 和 EL 模型的性能相当高,但 RE 和 ND 任务在文档级别上仍然具有挑战性。对数据集的进一步改进可以为实际使用提供更准确、更有用的模型。我们在 https://github.com/janinaj/e2eBioMedRE/ 上提供了我们的模型和代码。数据库网址:https://github.com/janinaj/e2eBioMedRE/。
{"title":"Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.","authors":"M Janina Sarol, Gibong Hong, Evan Guerra, Halil Kilicoglu","doi":"10.1093/database/baae079","DOIUrl":"10.1093/database/baae079","url":null,"abstract":"<p><p>Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11352595/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142085936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RiceMetaSys: Drought-miR, a one-stop solution for drought responsive miRNAs-mRNA module in rice. RiceMetaSys:Drought-miR,水稻干旱响应 miRNAs-mRNA 模块的一站式解决方案。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-21 DOI: 10.1093/database/baae076
Deepesh Kumar, SureshKumar Venkadesan, Ratna Prabha, Shbana Begam, Bipratip Dutta, Dwijesh C Mishra, K K Chaturvedi, Girish Kumar Jha, Amolkumar U Solanke, Amitha Mithra Sevanthi

MicroRNAs are key players involved in stress responses in plants and reports are available on the role of miRNAs in drought stress response in rice. This work reports the development of a database, RiceMetaSys: Drought-miR, based on the meta-analysis of publicly available sRNA datasets. From 28 drought stress-specific sRNA datasets, we identified 216 drought-responsive miRNAs (DRMs). The major features of the database include genotype-, tissue- and miRNA ID-specific search options and comparison of genotypes to identify common miRNAs. Co-localization of the DRMs with the known quantitative trait loci (QTLs), i.e., meta-QTL regions governing drought tolerance in rice pertaining to different drought adaptive traits, narrowed down this to 37 promising DRMs. To identify the high confidence target genes of DRMs under drought stress, degradome datasets and web resource on drought-responsive genes (RiceMetaSys: DRG) were used. Out of the 216 unique DRMs, only 193 had targets with high stringent parameters. Out of the 1081 target genes identified by Degradome datasets, 730 showed differential expression under drought stress in at least one accession. To retrieve complete information on the target genes, the database has been linked with RiceMetaSys: DRG. Further, we updated the RiceMetaSys: DRGv1 developed earlier with the addition of DRGs identified from RNA-seq datasets from five rice genotypes. We also identified 759 putative novel miRNAs and their target genes employing stringent criteria. Novel miRNA search has all the search options of known miRNAs and additionally, it gives information on their in silico validation features. Simple sequence repeat markers for both the miRNAs and their target genes have also been designed and made available in the database. Network analysis of the target genes identified 60 hub genes which primarily act through abscisic acid pathway and jasmonic acid pathway. Co-localization of the hub genes with the meta-QTL regions governing drought tolerance narrowed down this to 16 most promising DRGs. Database URL: http://14.139.229.201/RiceMetaSys_miRNA Updated database of RiceMetaSys URL: http://14.139.229.201/RiceMetaSysA/Drought/.

微RNA是参与植物胁迫响应的关键因子,目前已有关于miRNA在水稻干旱胁迫响应中的作用的报道。这项工作报告了一个数据库的开发情况,即 RiceMetaSys:Drought-miR 数据库。我们从 28 个干旱胁迫特异性 sRNA 数据集中鉴定出 216 个干旱响应 miRNA(DRMs)。该数据库的主要特点包括针对基因型、组织和 miRNA ID 的搜索选项,以及比较基因型以确定常见的 miRNA。将 DRMs 与已知的数量性状位点(QTLs)(即与不同干旱适应性性状有关的支配水稻耐旱性的元 QTL 区域)共定位后,将 DRMs 的范围缩小到 37 个有希望的 DRMs。为了确定干旱胁迫下 DRMs 的高置信度靶基因,研究人员使用了降解组数据集和干旱响应基因网络资源(RiceMetaSys:DRG)。在 216 个独特的 DRMs 中,只有 193 个具有高严格参数的目标基因。在 Degradome 数据集确定的 1081 个靶基因中,有 730 个在干旱胁迫下至少在一个品种中表现出差异表达。为了检索目标基因的完整信息,该数据库已与 RiceMetaSys:DRG 链接。此外,我们还更新了早先开发的 RiceMetaSys:DRGv1,增加了从五个水稻基因型的 RNA-seq 数据集中鉴定出的 DRGs。我们还采用严格的标准鉴定了 759 个推测的新型 miRNA 及其靶基因。新型 miRNA 搜索具有已知 miRNA 的所有搜索选项,此外,它还提供了有关其硅验证特征的信息。数据库还设计并提供了 miRNA 及其靶基因的简单序列重复标记。目标基因网络分析确定了 60 个主要通过脱落酸途径和茉莉酸途径发挥作用的中心基因。中心基因与支配耐旱性的元 QTL 区域的共定位将范围缩小到 16 个最有希望的 DRGs。数据库网址:http://14.139.229.201/RiceMetaSys_miRNA 更新的RiceMetaSys数据库网址:http://14.139.229.201/RiceMetaSysA/Drought/。
{"title":"RiceMetaSys: Drought-miR, a one-stop solution for drought responsive miRNAs-mRNA module in rice.","authors":"Deepesh Kumar, SureshKumar Venkadesan, Ratna Prabha, Shbana Begam, Bipratip Dutta, Dwijesh C Mishra, K K Chaturvedi, Girish Kumar Jha, Amolkumar U Solanke, Amitha Mithra Sevanthi","doi":"10.1093/database/baae076","DOIUrl":"10.1093/database/baae076","url":null,"abstract":"<p><p>MicroRNAs are key players involved in stress responses in plants and reports are available on the role of miRNAs in drought stress response in rice. This work reports the development of a database, RiceMetaSys: Drought-miR, based on the meta-analysis of publicly available sRNA datasets. From 28 drought stress-specific sRNA datasets, we identified 216 drought-responsive miRNAs (DRMs). The major features of the database include genotype-, tissue- and miRNA ID-specific search options and comparison of genotypes to identify common miRNAs. Co-localization of the DRMs with the known quantitative trait loci (QTLs), i.e., meta-QTL regions governing drought tolerance in rice pertaining to different drought adaptive traits, narrowed down this to 37 promising DRMs. To identify the high confidence target genes of DRMs under drought stress, degradome datasets and web resource on drought-responsive genes (RiceMetaSys: DRG) were used. Out of the 216 unique DRMs, only 193 had targets with high stringent parameters. Out of the 1081 target genes identified by Degradome datasets, 730 showed differential expression under drought stress in at least one accession. To retrieve complete information on the target genes, the database has been linked with RiceMetaSys: DRG. Further, we updated the RiceMetaSys: DRGv1 developed earlier with the addition of DRGs identified from RNA-seq datasets from five rice genotypes. We also identified 759 putative novel miRNAs and their target genes employing stringent criteria. Novel miRNA search has all the search options of known miRNAs and additionally, it gives information on their in silico validation features. Simple sequence repeat markers for both the miRNAs and their target genes have also been designed and made available in the database. Network analysis of the target genes identified 60 hub genes which primarily act through abscisic acid pathway and jasmonic acid pathway. Co-localization of the hub genes with the meta-QTL regions governing drought tolerance narrowed down this to 16 most promising DRGs. Database URL: http://14.139.229.201/RiceMetaSys_miRNA Updated database of RiceMetaSys URL: http://14.139.229.201/RiceMetaSysA/Drought/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11338179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142016614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing microbiome research through standardized data and metadata collection: introducing the Microbiome Research Data Toolkit. 通过标准化数据和元数据收集推进微生物组研究:介绍微生物组研究数据工具包。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-21 DOI: 10.1093/database/baae062
Lyndon Zass, Lamech M Mwapagha, Adetola F Louis-Jacques, Imane Allali, Julius Mulindwa, Anmol Kiran, Mariem Hanachi, Oussama Souiai, Nicola Mulder, Ovokeraye H Oduaran

Microbiome research has made significant gains with the evolution of sequencing technologies. Ensuring comparability between studies and enhancing the findability, accessibility, interoperability and reproducibility of microbiome data are crucial for maximizing the value of this growing body of research. Addressing the challenges of standardized metadata reporting, collection and curation, the Microbiome Working Group of the Human Hereditary and Health in Africa (H3Africa) consortium aimed to develop a comprehensive solution. In this paper, we present the Microbiome Research Data Toolkit, a versatile tool designed to standardize microbiome research metadata, facilitate MIxS-MIMS and PhenX reporting, standardize prospective collection of participant biological and lifestyle data, and retrospectively harmonize such data. This toolkit enables past, present and future microbiome research endeavors to collaborate effectively, fostering novel collaborations and accelerating knowledge discovery in the field. Database URL: https://doi.org/10.25375/uct.24218999.v2.

随着测序技术的发展,微生物组研究取得了重大成果。确保研究之间的可比性,提高微生物组数据的可查找性、可访问性、互操作性和可重复性,对于最大限度地发挥这一不断增长的研究成果的价值至关重要。为了应对标准化元数据报告、收集和整理方面的挑战,非洲人类遗传与健康(H3Africa)联盟微生物组工作组旨在开发一种全面的解决方案。在本文中,我们介绍了微生物组研究数据工具包,这是一个多功能工具,旨在实现微生物组研究元数据的标准化、促进 MIxS-MIMS 和 PhenX 报告、实现参与者生物和生活方式数据前瞻性收集的标准化,并对此类数据进行回顾性协调。该工具包使过去、现在和未来的微生物组研究工作能够有效合作,促进新的合作并加速该领域的知识发现。数据库网址:https://doi.org/10.25375/uct.24218999.v2。
{"title":"Advancing microbiome research through standardized data and metadata collection: introducing the Microbiome Research Data Toolkit.","authors":"Lyndon Zass, Lamech M Mwapagha, Adetola F Louis-Jacques, Imane Allali, Julius Mulindwa, Anmol Kiran, Mariem Hanachi, Oussama Souiai, Nicola Mulder, Ovokeraye H Oduaran","doi":"10.1093/database/baae062","DOIUrl":"10.1093/database/baae062","url":null,"abstract":"<p><p>Microbiome research has made significant gains with the evolution of sequencing technologies. Ensuring comparability between studies and enhancing the findability, accessibility, interoperability and reproducibility of microbiome data are crucial for maximizing the value of this growing body of research. Addressing the challenges of standardized metadata reporting, collection and curation, the Microbiome Working Group of the Human Hereditary and Health in Africa (H3Africa) consortium aimed to develop a comprehensive solution. In this paper, we present the Microbiome Research Data Toolkit, a versatile tool designed to standardize microbiome research metadata, facilitate MIxS-MIMS and PhenX reporting, standardize prospective collection of participant biological and lifestyle data, and retrospectively harmonize such data. This toolkit enables past, present and future microbiome research endeavors to collaborate effectively, fostering novel collaborations and accelerating knowledge discovery in the field. Database URL: https://doi.org/10.25375/uct.24218999.v2.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11338178/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142016613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GMMID: genetically modified mice information database. GMMID:转基因小鼠信息数据库。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-19 DOI: 10.1093/database/baae078
Menglin Xu, Minghui Fang, Qiyang Chen, Wenjun Xiao, Zhixuan Xu, Bao Cai, Zhenyang Zhao, Tao Wang, Zhu Zhu, Yingshan Chen, Yue Zhu, Mingzhou Dai, Tiancheng Jiang, Xinyi Li, Siuwing Chun, Runhua Zhou, Yafei Li, Yueyue Gou, Jingjing He, Lin Luo, Linlin You, Xuan Jiang

Genetically engineered mouse models (GEMMs) are vital for elucidating gene function and disease mechanisms. An overwhelming number of GEMM lines have been generated, but endeavors to collect and organize the information of these GEMMs are seriously lagging behind. Only a few databases are developed for the information of current GEMMs, and these databases lack biological descriptions of allele compositions, which poses a challenge for nonexperts in mouse genetics to interpret the genetic information of these mice. Moreover, these databases usually do not provide information on human diseases related to the GEMM, which hinders the dissemination of the insights the GEMM provides as a human disease model. To address these issues, we developed an algorithm to annotate all the allele compositions that have been reported with Python programming and have developed the genetically modified mice information database (GMMID; http://www.gmmid.cn), a user-friendly database that integrates information on GEMMs and related diseases from various databases, including National Center for Biotechnology Information, Mouse Genome Informatics, Online Mendelian Inheritance in Man, International Mouse Phenotyping Consortium, and Jax lab. GMMID provides comprehensive genetic information on >70 055 alleles, 65 520 allele compositions, and ∼4000 diseases, along with biologically meaningful descriptions of alleles and allele combinations. Furthermore, it provides spatiotemporal visualization of anatomical tissues mentioned in these descriptions, shown alongside the allele compositions. Compared to existing mouse databases, GMMID considers the needs of researchers across different disciplines and presents obscure genetic information in an intuitive and easy-to-understand format. It facilitates users in obtaining complete genetic information more efficiently, making it an essential resource for cross-disciplinary researchers. Database URL: http://www.gmmid.cn.

基因工程小鼠模型(GEMM)对于阐明基因功能和疾病机制至关重要。目前已产生了大量基因工程小鼠品系,但收集和整理这些基因工程小鼠信息的工作却严重滞后。目前,只有少数几个数据库是为当前的 GEMMs 信息开发的,这些数据库缺乏等位基因组成的生物学描述,这给非小鼠遗传学专家解读这些小鼠的遗传信息带来了挑战。此外,这些数据库通常不提供与 GEMM 相关的人类疾病信息,这阻碍了 GEMM 作为人类疾病模型所提供的见解的传播。为了解决这些问题,我们开发了一种算法,用Python编程注释所有已报道的等位基因组成,并开发了转基因小鼠信息数据库(GMMID; http://www.gmmid.cn),这是一个用户友好型数据库,整合了美国国家生物技术信息中心、小鼠基因组信息学、在线人类孟德尔遗传、国际小鼠表型协会和Jax实验室等各种数据库中有关GEMM和相关疾病的信息。GMMID 提供超过 70 055 个等位基因、65 520 个等位基因组合和 4000 种疾病的全面遗传信息,以及等位基因和等位基因组合的生物学意义描述。此外,它还提供了这些描述中提到的解剖组织的时空可视化,与等位基因组合一起显示。与现有的小鼠数据库相比,GMMID 考虑到了不同学科研究人员的需求,以直观易懂的格式呈现了晦涩难懂的遗传信息。它便于用户更高效地获取完整的遗传信息,是跨学科研究人员的必备资源。数据库网址:http://www.gmmid.cn.
{"title":"GMMID: genetically modified mice information database.","authors":"Menglin Xu, Minghui Fang, Qiyang Chen, Wenjun Xiao, Zhixuan Xu, Bao Cai, Zhenyang Zhao, Tao Wang, Zhu Zhu, Yingshan Chen, Yue Zhu, Mingzhou Dai, Tiancheng Jiang, Xinyi Li, Siuwing Chun, Runhua Zhou, Yafei Li, Yueyue Gou, Jingjing He, Lin Luo, Linlin You, Xuan Jiang","doi":"10.1093/database/baae078","DOIUrl":"10.1093/database/baae078","url":null,"abstract":"<p><p>Genetically engineered mouse models (GEMMs) are vital for elucidating gene function and disease mechanisms. An overwhelming number of GEMM lines have been generated, but endeavors to collect and organize the information of these GEMMs are seriously lagging behind. Only a few databases are developed for the information of current GEMMs, and these databases lack biological descriptions of allele compositions, which poses a challenge for nonexperts in mouse genetics to interpret the genetic information of these mice. Moreover, these databases usually do not provide information on human diseases related to the GEMM, which hinders the dissemination of the insights the GEMM provides as a human disease model. To address these issues, we developed an algorithm to annotate all the allele compositions that have been reported with Python programming and have developed the genetically modified mice information database (GMMID; http://www.gmmid.cn), a user-friendly database that integrates information on GEMMs and related diseases from various databases, including National Center for Biotechnology Information, Mouse Genome Informatics, Online Mendelian Inheritance in Man, International Mouse Phenotyping Consortium, and Jax lab. GMMID provides comprehensive genetic information on >70 055 alleles, 65 520 allele compositions, and ∼4000 diseases, along with biologically meaningful descriptions of alleles and allele combinations. Furthermore, it provides spatiotemporal visualization of anatomical tissues mentioned in these descriptions, shown alongside the allele compositions. Compared to existing mouse databases, GMMID considers the needs of researchers across different disciplines and presents obscure genetic information in an intuitive and easy-to-understand format. It facilitates users in obtaining complete genetic information more efficiently, making it an essential resource for cross-disciplinary researchers. Database URL: http://www.gmmid.cn.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11334936/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142008463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ncStem: a comprehensive resource of curated and predicted ncRNAs in cancer stemness. ncStem:癌症干细胞中经过整理和预测的 ncRNAs 的综合资源。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-13 DOI: 10.1093/database/baae081
Hui Liu, Nan Zhang, Yijie Jia, Jun Wang, Aokun Ye, Siru Yang, Honghan Zhou, Yingli Lv, Chaohan Xu, Shuyuan Wang

Cancer stemness plays an important role in cancer initiation and progression, and is the major cause of tumor invasion, metastasis, recurrence, and poor prognosis. Non-coding RNAs (ncRNAs) are a class of RNA transcripts that generally cannot encode proteins and have been demonstrated to play a critical role in regulating cancer stemness. Here, we developed the ncStem database to record manually curated and predicted ncRNAs associated with cancer stemness. In total, ncStem contains 645 experimentally verified entries, including 159 long non-coding RNAs (lncRNAs), 254 microRNAs (miRNAs), 39 circular RNAs (circRNAs), and 5 other ncRNAs. The detailed information of each entry includes the ncRNA name, ncRNA identifier, disease, reference, expression direction, tissue, species, and so on. In addition, ncStem also provides computationally predicted cancer stemness-associated ncRNAs for 33 TCGA cancers, which were prioritized using the random walk with restart (RWR) algorithm based on regulatory and co-expression networks. The total predicted cancer stemness-associated ncRNAs included 11 132 lncRNAs and 972 miRNAs. Moreover, ncStem provides tools for functional enrichment analysis, survival analysis, and cell location interrogation for cancer stemness-associated ncRNAs. In summary, ncStem provides a platform to retrieve cancer stemness-associated ncRNAs, which may facilitate research on cancer stemness and offer potential targets for cancer treatment. Database URL: http://www.nidmarker-db.cn/ncStem/index.html.

癌症干细胞在癌症的发生和发展过程中起着重要作用,是导致肿瘤侵袭、转移、复发和预后不良的主要原因。非编码RNA(ncRNA)是一类通常不能编码蛋白质的RNA转录本,已被证明在调控癌症干性方面发挥着关键作用。在这里,我们开发了ncStem数据库,以记录人工编辑和预测的与癌症干性相关的ncRNA。ncStem总共包含645个经过实验验证的条目,其中包括159个长非编码RNA(lncRNA)、254个microRNA(miRNA)、39个环状RNA(circRNA)和5个其他ncRNA。每个条目的详细信息包括 ncRNA 名称、ncRNA 标识符、疾病、参考文献、表达方向、组织、物种等。此外,ncStem 还为 33 种 TCGA 癌症提供了计算预测的癌症干相关 ncRNA,这些 ncRNA 是根据调控和共表达网络使用随机行走与重启(RWR)算法进行优先排序的。预测的癌症干相关ncRNA包括11 132个lncRNA和972个miRNA。此外,ncStem 还提供了癌症干相关 ncRNA 的功能富集分析、生存分析和细胞位置检测工具。总之,ncStem 提供了一个检索癌症干相关 ncRNA 的平台,可促进癌症干性研究并为癌症治疗提供潜在靶点。数据库网址:http://www.nidmarker-db.cn/ncStem/index.html。
{"title":"ncStem: a comprehensive resource of curated and predicted ncRNAs in cancer stemness.","authors":"Hui Liu, Nan Zhang, Yijie Jia, Jun Wang, Aokun Ye, Siru Yang, Honghan Zhou, Yingli Lv, Chaohan Xu, Shuyuan Wang","doi":"10.1093/database/baae081","DOIUrl":"10.1093/database/baae081","url":null,"abstract":"<p><p>Cancer stemness plays an important role in cancer initiation and progression, and is the major cause of tumor invasion, metastasis, recurrence, and poor prognosis. Non-coding RNAs (ncRNAs) are a class of RNA transcripts that generally cannot encode proteins and have been demonstrated to play a critical role in regulating cancer stemness. Here, we developed the ncStem database to record manually curated and predicted ncRNAs associated with cancer stemness. In total, ncStem contains 645 experimentally verified entries, including 159 long non-coding RNAs (lncRNAs), 254 microRNAs (miRNAs), 39 circular RNAs (circRNAs), and 5 other ncRNAs. The detailed information of each entry includes the ncRNA name, ncRNA identifier, disease, reference, expression direction, tissue, species, and so on. In addition, ncStem also provides computationally predicted cancer stemness-associated ncRNAs for 33 TCGA cancers, which were prioritized using the random walk with restart (RWR) algorithm based on regulatory and co-expression networks. The total predicted cancer stemness-associated ncRNAs included 11 132 lncRNAs and 972 miRNAs. Moreover, ncStem provides tools for functional enrichment analysis, survival analysis, and cell location interrogation for cancer stemness-associated ncRNAs. In summary, ncStem provides a platform to retrieve cancer stemness-associated ncRNAs, which may facilitate research on cancer stemness and offer potential targets for cancer treatment. Database URL: http://www.nidmarker-db.cn/ncStem/index.html.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11321241/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141975336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Functional implications of glycans and their curation: insights from the workshop held at the 16th Annual International Biocuration Conference in Padua, Italy. 聚糖的功能意义及其整理:意大利帕多瓦第 16 届国际生物化学年会研讨会的启示。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-13 DOI: 10.1093/database/baae073
Karina Martinez, Jon Agirre, Yukie Akune, Kiyoko F Aoki-Kinoshita, Cecilia Arighi, Kristian B Axelsen, Evan Bolton, Emily Bordeleau, Nathan J Edwards, Elisa Fadda, Ten Feizi, Catherine Hayes, Callum M Ives, Hiren J Joshi, Khakurel Krishna Prasad, Sofia Kossida, Frederique Lisacek, Yan Liu, Thomas Lütteke, Junfeng Ma, Adnan Malik, Maria Martin, Akul Y Mehta, Sriram Neelamegham, Kalpana Panneerselvam, René Ranzinger, Sylvie Ricard-Blum, Gaoussou Sanou, Vijay Shanker, Paul D Thomas, Michael Tiemeyer, James Urban, Randi Vita, Jeet Vora, Yasunori Yamamoto, Raja Mazumder

Dynamic changes in protein glycosylation impact human health and disease progression. However, current resources that capture disease and phenotype information focus primarily on the macromolecules within the central dogma of molecular biology (DNA, RNA, proteins). To gain a better understanding of organisms, there is a need to capture the functional impact of glycans and glycosylation on biological processes. A workshop titled "Functional impact of glycans and their curation" was held in conjunction with the 16th Annual International Biocuration Conference to discuss ongoing worldwide activities related to glycan function curation. This workshop brought together subject matter experts, tool developers, and biocurators from over 20 projects and bioinformatics resources. Participants discussed four key topics for each of their resources: (i) how they curate glycan function-related data from publications and other sources, (ii) what type of data they would like to acquire, (iii) what data they currently have, and (iv) what standards they use. Their answers contributed input that provided a comprehensive overview of state-of-the-art glycan function curation and annotations. This report summarizes the outcome of discussions, including potential solutions and areas where curators, data wranglers, and text mining experts can collaborate to address current gaps in glycan and glycosylation annotations, leveraging each other's work to improve their respective resources and encourage impactful data sharing among resources. Database URL: https://wiki.glygen.org/Glycan_Function_Workshop_2023.

蛋白质糖基化的动态变化影响着人类的健康和疾病的发展。然而,目前捕捉疾病和表型信息的资源主要集中在分子生物学中心教条(DNA、RNA、蛋白质)中的大分子上。为了更好地了解生物体,有必要了解聚糖和糖基化对生物过程的功能影响。第 16 届国际生物组化年会期间举办了题为 "聚糖的功能影响及其整理 "的研讨会,讨论全球范围内正在开展的与聚糖功能整理相关的活动。来自 20 多个项目和生物信息学资源的主题专家、工具开发人员和生物学家参加了此次研讨会。与会者讨论了各自资源的四个关键主题:(i) 他们如何从出版物和其他来源收集聚糖功能相关数据,(ii) 他们希望获得哪类数据,(iii) 他们目前拥有哪些数据,以及 (iv) 他们使用哪些标准。他们的回答为全面了解最先进的聚糖功能整理和注释提供了信息。本报告总结了讨论的结果,包括潜在的解决方案以及馆员、数据管理人员和文本挖掘专家可以合作的领域,以解决目前在聚糖和糖基化注释方面存在的差距,利用彼此的工作来改进各自的资源,并鼓励资源之间进行有影响力的数据共享。数据库网址:https://wiki.glygen.org/Glycan_Function_Workshop_2023.
{"title":"Functional implications of glycans and their curation: insights from the workshop held at the 16th Annual International Biocuration Conference in Padua, Italy.","authors":"Karina Martinez, Jon Agirre, Yukie Akune, Kiyoko F Aoki-Kinoshita, Cecilia Arighi, Kristian B Axelsen, Evan Bolton, Emily Bordeleau, Nathan J Edwards, Elisa Fadda, Ten Feizi, Catherine Hayes, Callum M Ives, Hiren J Joshi, Khakurel Krishna Prasad, Sofia Kossida, Frederique Lisacek, Yan Liu, Thomas Lütteke, Junfeng Ma, Adnan Malik, Maria Martin, Akul Y Mehta, Sriram Neelamegham, Kalpana Panneerselvam, René Ranzinger, Sylvie Ricard-Blum, Gaoussou Sanou, Vijay Shanker, Paul D Thomas, Michael Tiemeyer, James Urban, Randi Vita, Jeet Vora, Yasunori Yamamoto, Raja Mazumder","doi":"10.1093/database/baae073","DOIUrl":"10.1093/database/baae073","url":null,"abstract":"<p><p>Dynamic changes in protein glycosylation impact human health and disease progression. However, current resources that capture disease and phenotype information focus primarily on the macromolecules within the central dogma of molecular biology (DNA, RNA, proteins). To gain a better understanding of organisms, there is a need to capture the functional impact of glycans and glycosylation on biological processes. A workshop titled \"Functional impact of glycans and their curation\" was held in conjunction with the 16th Annual International Biocuration Conference to discuss ongoing worldwide activities related to glycan function curation. This workshop brought together subject matter experts, tool developers, and biocurators from over 20 projects and bioinformatics resources. Participants discussed four key topics for each of their resources: (i) how they curate glycan function-related data from publications and other sources, (ii) what type of data they would like to acquire, (iii) what data they currently have, and (iv) what standards they use. Their answers contributed input that provided a comprehensive overview of state-of-the-art glycan function curation and annotations. This report summarizes the outcome of discussions, including potential solutions and areas where curators, data wranglers, and text mining experts can collaborate to address current gaps in glycan and glycosylation annotations, leveraging each other's work to improve their respective resources and encourage impactful data sharing among resources. Database URL: https://wiki.glygen.org/Glycan_Function_Workshop_2023.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11321244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141975335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. BioCreative VIII 挑战赛和研讨会上的 BioRED 轨道生物医学关系语料库。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-09 DOI: 10.1093/database/baae071
Rezarta Islamaj, Chih-Hsuan Wei, Po-Ting Lai, Ling Luo, Cathleen Coss, Preeti Gokal Kochar, Nicholas Miliaras, Oleg Rodionov, Keiko Sekiya, Dorothy Trinh, Deborah Whitman, Zhiyong Lu

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

自动识别生物医学关系是对已发表文献的非结构化文本中所含信息进行语义理解的重要一步。第八届生物创想大会的 BioRED 赛道旨在通过向与会者提供 BioRED-BC8 语料库来促进此类方法的发展。BioRED-BC8 语料库由 1000 篇 PubMed 文档组成,这些文档经过人工编辑,包括疾病、基因/蛋白、化学物质、细胞系、基因变异和物种,以及它们之间的配对关系,即疾病-基因、化学物质-基因、疾病-变异、基因-基因、化学物质-疾病、化学物质-化学物质、化学物质-变异和变异-变异。此外,关系还分为以下语义类别:正相关、负相关、结合、转换、药物相互作用、比较、共处理和关联。与之前大多数公开可用的语料库不同,所有关系都是在文档级别而非句子级别上表达的,因此,实体被规范化为标准化词汇表中相应的概念标识符,即疾病和化学物质被规范化为 MeSH,基因(和蛋白质)被规范化为美国国家生物技术信息中心(NCBI)基因,物种被规范化为 NCBI 分类,细胞系被规范化为 Cellosaurus,基因/蛋白质变体被规范化为单核苷酸多态性数据库。最后,每种注释关系都被归类为 "新颖",这取决于它在出版物中是新发现还是实验验证。这种区分有助于将新发现与同一文本中提供已知事实和/或背景知识的其他关系区分开来。BioRED-BC8 语料库使用了之前由 600 篇 PubMed 文章组成的 BioRED 语料库作为训练数据集,并包括一组新发表的 400 篇文章作为挑战赛的测试数据。所有测试文章都是由美国国家医学图书馆的专家生物学家为 BioCreative VIII 挑战赛进行人工标注的,标注过程采用原始标注指南,每篇文章都要经过三轮标注过程进行双重标注,直到所有馆员达成完全一致为止。本手稿详细介绍了 BioRED-BC8 语料库作为生物医学命名实体识别和关系提取的重要资源的特点。利用这一新资源,我们展示了生物医学文本挖掘算法开发的进展。数据库网址:https://codalab.lisn.upsaclay.fr/competitions/16381.
{"title":"The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.","authors":"Rezarta Islamaj, Chih-Hsuan Wei, Po-Ting Lai, Ling Luo, Cathleen Coss, Preeti Gokal Kochar, Nicholas Miliaras, Oleg Rodionov, Keiko Sekiya, Dorothy Trinh, Deborah Whitman, Zhiyong Lu","doi":"10.1093/database/baae071","DOIUrl":"10.1093/database/baae071","url":null,"abstract":"<p><p>The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11315767/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141912109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SoDCoD: a comprehensive database of Cu/Zn superoxide dismutase conformational diversity caused by ALS-linked gene mutations and other perturbations. SoDCoD:由 ALS 相关基因突变和其他扰动引起的 Cu/Zn 超氧化物歧化酶构象多样性综合数据库。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-08 DOI: 10.1093/database/baae064
Riko Tabuchi, Yurika Momozawa, Yuki Hayashi, Hisashi Noma, Hidenori Ichijo, Takao Fujisawa

A structural alteration in copper/zinc superoxide dismutase (SOD1) is one of the common features caused by amyotrophic lateral sclerosis (ALS)-linked mutations. Although a large number of SOD1 variants have been reported in ALS patients, the detailed structural properties of each variant are not well summarized. We present SoDCoD, a database of superoxide dismutase conformational diversity, collecting our comprehensive biochemical analyses of the structural changes in SOD1 caused by ALS-linked gene mutations and other perturbations. SoDCoD version 1.0 contains information about the properties of 188 types of SOD1 mutants, including structural changes and their binding to Derlin-1, as well as a set of genes contributing to the proteostasis of mutant-like wild-type SOD1. This database provides valuable insights into the diagnosis and treatment of ALS, particularly by targeting conformational alterations in SOD1. Database URL: https://fujisawagroup.github.io/SoDCoDweb/.

铜/锌超氧化物歧化酶(SOD1)的结构改变是肌萎缩性脊髓侧索硬化症(ALS)相关突变引起的常见特征之一。虽然 ALS 患者中出现了大量 SOD1 变体,但对每种变体的详细结构特性却没有很好的总结。我们介绍了超氧化物歧化酶构象多样性数据库 SoDCoD,该数据库收集了我们对与 ALS 相关的基因突变和其他扰动引起的 SOD1 结构变化所做的全面生化分析。SoDCoD 1.0 版包含 188 种 SOD1 突变体的特性信息,包括结构变化及其与 Derlin-1 的结合,以及一组有助于突变型野生型 SOD1 蛋白稳定的基因。该数据库为 ALS 的诊断和治疗提供了有价值的见解,特别是通过靶向 SOD1 的构象改变。数据库网址:https://fujisawagroup.github.io/SoDCoDweb/。
{"title":"SoDCoD: a comprehensive database of Cu/Zn superoxide dismutase conformational diversity caused by ALS-linked gene mutations and other perturbations.","authors":"Riko Tabuchi, Yurika Momozawa, Yuki Hayashi, Hisashi Noma, Hidenori Ichijo, Takao Fujisawa","doi":"10.1093/database/baae064","DOIUrl":"10.1093/database/baae064","url":null,"abstract":"<p><p>A structural alteration in copper/zinc superoxide dismutase (SOD1) is one of the common features caused by amyotrophic lateral sclerosis (ALS)-linked mutations. Although a large number of SOD1 variants have been reported in ALS patients, the detailed structural properties of each variant are not well summarized. We present SoDCoD, a database of superoxide dismutase conformational diversity, collecting our comprehensive biochemical analyses of the structural changes in SOD1 caused by ALS-linked gene mutations and other perturbations. SoDCoD version 1.0 contains information about the properties of 188 types of SOD1 mutants, including structural changes and their binding to Derlin-1, as well as a set of genes contributing to the proteostasis of mutant-like wild-type SOD1. This database provides valuable insights into the diagnosis and treatment of ALS, particularly by targeting conformational alterations in SOD1. Database URL: https://fujisawagroup.github.io/SoDCoDweb/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":"0"},"PeriodicalIF":3.4,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11315765/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141912108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. 第八届生物创新大会 BioRED(生物医学关系提取数据集)赛道概览。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-08 DOI: 10.1093/database/baae069
Rezarta Islamaj, Po-Ting Lai, Chih-Hsuan Wei, Ling Luo, Tiago Almeida, Richard A A Jonker, Sofia I R Conceição, Diana F Sousa, Cong-Phuoc Phan, Jung-Hsien Chiang, Jiru Li, Dinghao Pan, Wilailack Meesawad, Richard Tzong-Han Tsai, M Janina Sarol, Gibong Hong, Airat Valiev, Elena Tutubalina, Shao-Man Lee, Yi-Yu Hsu, Mingjie Li, Karin Verspoor, Zhiyong Lu

The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.

第八届生物创新大会(BioCreative VIII)的 "BioRED "分会场呼吁社会各界共同努力,对非结构化文本中的生物医学实体之间的关系进行识别、语义分类并突出其新颖性。关系提取对于从药物发现到定制医疗解决方案等许多生物医学自然语言处理(NLP)应用至关重要。BioRED 赛道模拟了生物医学关系提取的实际应用,因此考虑了多种生物医学实体类型,并将其归一化为特定的相应数据库标识符,还定义了它们在文档中的关系。挑战赛由两个子任务组成:(i) 子任务 1 给参赛者提供文章文本和人类专家注释的实体,要求他们提取关系对、识别其语义类型和新颖性因素;(ii) 子任务 2 只给参赛者提供文章文本,要求他们建立一个端到端系统,能够识别和分类关系及其新颖性。我们共收到来自全球 14 个团队的 94 份作品。子任务 1 的最高 F 分数为关系对识别率为 77.17%,关系类型识别率为 58.95%,新颖性识别率为 59.22%,在对综合关系提取的上述所有方面进行评估时,得分率为 44.55%。子任务 2 的最高 F 分数表现为:关系对识别为 55.84%,关系类型识别为 59.22%,新颖性识别为 44.55%:关系对为 55.84%,关系类型为 43.03%,新颖性为 42.74%,综合关系提取为 32.75%。整个 BioRED 赛道数据集和其他挑战材料可在 https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/、https://codalab.lisn.upsaclay.fr/competitions/13377 和 https://codalab.lisn.upsaclay.fr/competitions/13378 上查阅。数据库网址:https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378。
{"title":"The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.","authors":"Rezarta Islamaj, Po-Ting Lai, Chih-Hsuan Wei, Ling Luo, Tiago Almeida, Richard A A Jonker, Sofia I R Conceição, Diana F Sousa, Cong-Phuoc Phan, Jung-Hsien Chiang, Jiru Li, Dinghao Pan, Wilailack Meesawad, Richard Tzong-Han Tsai, M Janina Sarol, Gibong Hong, Airat Valiev, Elena Tutubalina, Shao-Man Lee, Yi-Yu Hsu, Mingjie Li, Karin Verspoor, Zhiyong Lu","doi":"10.1093/database/baae069","DOIUrl":"10.1093/database/baae069","url":null,"abstract":"<p><p>The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11306928/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141901212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Database: The Journal of Biological Databases and Curation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1