首页 > 最新文献

Database: The Journal of Biological Databases and Curation最新文献

英文 中文
A change language for ontologies and knowledge graphs.
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-01-22 DOI: 10.1093/database/baae133
Harshad Hegde, Jennifer Vendetti, Damien Goutte-Gattat, J Harry Caufield, John B Graybeal, Nomi L Harris, Naouel Karam, Christian Kindermann, Nicolas Matentzoglu, James A Overton, Mark A Musen, Christopher J Mungall

Ontologies and knowledge graphs (KGs) are general-purpose computable representations of some domain, such as human anatomy, and are frequently a crucial part of modern information systems. Most of these structures change over time, incorporating new knowledge or information that was previously missing. Managing these changes is a challenge, both in terms of communicating changes to users and providing mechanisms to make it easier for multiple stakeholders to contribute. To fill that need, we have created KGCL, the Knowledge Graph Change Language (https://github.com/INCATools/kgcl), a standard data model for describing changes to KGs and ontologies at a high level, and an accompanying human-readable Controlled Natural Language (CNL). This language serves two purposes: a curator can use it to request desired changes, and it can also be used to describe changes that have already happened, corresponding to the concepts of "apply patch" and "diff" commonly used for managing changes in text documents and computer programs. Another key feature of KGCL is that descriptions are at a high enough level to be useful and understood by a variety of stakeholders-e.g. ontology edits can be specified by commands like "add synonym 'arm' to 'forelimb'" or "move 'Parkinson disease' under 'neurodegenerative disease'." We have also built a suite of tools for managing ontology changes. These include an automated agent that integrates with and monitors GitHub ontology repositories and applies any requested changes and a new component in the BioPortal ontology resource that allows users to make change requests directly from within the BioPortal user interface. Overall, the KGCL data model, its CNL, and associated tooling allow for easier management and processing of changes associated with the development of ontologies and KGs. Database URL: https://github.com/INCATools/kgcl.

{"title":"A change language for ontologies and knowledge graphs.","authors":"Harshad Hegde, Jennifer Vendetti, Damien Goutte-Gattat, J Harry Caufield, John B Graybeal, Nomi L Harris, Naouel Karam, Christian Kindermann, Nicolas Matentzoglu, James A Overton, Mark A Musen, Christopher J Mungall","doi":"10.1093/database/baae133","DOIUrl":"https://doi.org/10.1093/database/baae133","url":null,"abstract":"<p><p>Ontologies and knowledge graphs (KGs) are general-purpose computable representations of some domain, such as human anatomy, and are frequently a crucial part of modern information systems. Most of these structures change over time, incorporating new knowledge or information that was previously missing. Managing these changes is a challenge, both in terms of communicating changes to users and providing mechanisms to make it easier for multiple stakeholders to contribute. To fill that need, we have created KGCL, the Knowledge Graph Change Language (https://github.com/INCATools/kgcl), a standard data model for describing changes to KGs and ontologies at a high level, and an accompanying human-readable Controlled Natural Language (CNL). This language serves two purposes: a curator can use it to request desired changes, and it can also be used to describe changes that have already happened, corresponding to the concepts of \"apply patch\" and \"diff\" commonly used for managing changes in text documents and computer programs. Another key feature of KGCL is that descriptions are at a high enough level to be useful and understood by a variety of stakeholders-e.g. ontology edits can be specified by commands like \"add synonym 'arm' to 'forelimb'\" or \"move 'Parkinson disease' under 'neurodegenerative disease'.\" We have also built a suite of tools for managing ontology changes. These include an automated agent that integrates with and monitors GitHub ontology repositories and applies any requested changes and a new component in the BioPortal ontology resource that allows users to make change requests directly from within the BioPortal user interface. Overall, the KGCL data model, its CNL, and associated tooling allow for easier management and processing of changes associated with the development of ontologies and KGs. Database URL: https://github.com/INCATools/kgcl.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143022562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Standardized pipelines support and facilitate integration of diverse datasets at the Rat Genome Database.
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-01-22 DOI: 10.1093/database/baae132
Jennifer R Smith, Marek A Tutaj, Jyothi Thota, Logan Lamers, Adam C Gibson, Akhilanand Kundurthi, Varun Reddy Gollapally, Kent C Brodie, Stacy Zacher, Stanley J F Laulederkind, G Thomas Hayman, Shur-Jen Wang, Monika Tutaj, Mary L Kaldunski, Mahima Vedi, Wendy M Demos, Jeffrey L De Pons, Melinda R Dwinell, Anne E Kwitek

The Rat Genome Database (RGD) is a multispecies knowledgebase which integrates genetic, multiomic, phenotypic, and disease data across 10 mammalian species. To support cross-species, multiomics studies and to enhance and expand on data manually extracted from the biomedical literature by the RGD team of expert curators, RGD imports and integrates data from multiple sources. These include major databases and a substantial number of domain-specific resources, as well as direct submissions by individual researchers. The incorporation of these diverse datatypes is handled by a growing list of automated import, export, data processing, and quality control pipelines. This article outlines the development over time of a standardized infrastructure for automated RGD pipelines with a summary of key design decisions and a focus on lessons learned.

{"title":"Standardized pipelines support and facilitate integration of diverse datasets at the Rat Genome Database.","authors":"Jennifer R Smith, Marek A Tutaj, Jyothi Thota, Logan Lamers, Adam C Gibson, Akhilanand Kundurthi, Varun Reddy Gollapally, Kent C Brodie, Stacy Zacher, Stanley J F Laulederkind, G Thomas Hayman, Shur-Jen Wang, Monika Tutaj, Mary L Kaldunski, Mahima Vedi, Wendy M Demos, Jeffrey L De Pons, Melinda R Dwinell, Anne E Kwitek","doi":"10.1093/database/baae132","DOIUrl":"https://doi.org/10.1093/database/baae132","url":null,"abstract":"<p><p>The Rat Genome Database (RGD) is a multispecies knowledgebase which integrates genetic, multiomic, phenotypic, and disease data across 10 mammalian species. To support cross-species, multiomics studies and to enhance and expand on data manually extracted from the biomedical literature by the RGD team of expert curators, RGD imports and integrates data from multiple sources. These include major databases and a substantial number of domain-specific resources, as well as direct submissions by individual researchers. The incorporation of these diverse datatypes is handled by a growing list of automated import, export, data processing, and quality control pipelines. This article outlines the development over time of a standardized infrastructure for automated RGD pipelines with a summary of key design decisions and a focus on lessons learned.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143022144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to: The landscape of microRNA interaction annotation: analysis of three rare disorders as a case study. 修正:microRNA相互作用的景观注释:作为案例研究的三种罕见疾病的分析。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-01-13 DOI: 10.1093/database/baae131
{"title":"Correction to: The landscape of microRNA interaction annotation: analysis of three rare disorders as a case study.","authors":"","doi":"10.1093/database/baae131","DOIUrl":"10.1093/database/baae131","url":null,"abstract":"","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11726336/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142969881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations. LSD600:第一个带有生活方式与疾病关系注释的生物医学摘要语料库。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-01-13 DOI: 10.1093/database/baae129
Esmaeil Nourani, Evangelia-Mantelena Makri, Xiqing Mao, Sampo Pyysalo, Søren Brunak, Katerina Nastou, Lars Juhl Jensen

Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware models such as transformers are required to extract and classify these relations into specific relation types. However, no comprehensive LSF-disease RE system existed, nor a corpus suitable for developing one. We present LSD600 (available at https://zenodo.org/records/13952449), the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5027 diseases and 6930 LSF entities. We evaluated LSD600's quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multilabel RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications. Database URL: https://zenodo.org/records/13952449.

人们越来越认识到生活方式因素在疾病的发展和控制中起着重要作用。尽管它们很重要,但缺乏从文献中提取lsf与疾病之间关系的方法,这是将当前可用知识整合为结构化形式的必要步骤。由于简单的基于共现的关系提取(RE)方法无法区分不同类型的lsf疾病关系,因此需要诸如transformer之类的上下文感知模型来提取这些关系并将其分类为特定的关系类型。然而,目前还没有一个全面的lsf疾病RE系统,也没有一个适合开发的语料库。我们提出了LSD600(可在https://zenodo.org/records/13952449上获得),这是第一个专门为LSF疾病RE设计的语料库,包含600篇摘要,其中包含5027种疾病和6930种LSF实体之间8种不同类型的1900种关系。我们通过在语料库上训练RoBERTa模型来评估LSD600的质量,在hold - hold测试集中,多标签RE任务的f值达到68.5%。我们通过在两个Nutrition-Disease和FoodDisease数据集上使用训练好的模型进一步验证了LSD600,其f得分分别达到70.7%和80.7%。在这些性能结果的基础上,LSD600和在其上训练的RE系统可以成为填补该领域现有空白的宝贵资源,并为下游应用铺平道路。数据库地址:https://zenodo.org/records/13952449。
{"title":"LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations.","authors":"Esmaeil Nourani, Evangelia-Mantelena Makri, Xiqing Mao, Sampo Pyysalo, Søren Brunak, Katerina Nastou, Lars Juhl Jensen","doi":"10.1093/database/baae129","DOIUrl":"https://doi.org/10.1093/database/baae129","url":null,"abstract":"<p><p>Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware models such as transformers are required to extract and classify these relations into specific relation types. However, no comprehensive LSF-disease RE system existed, nor a corpus suitable for developing one. We present LSD600 (available at https://zenodo.org/records/13952449), the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5027 diseases and 6930 LSF entities. We evaluated LSD600's quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multilabel RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications. Database URL: https://zenodo.org/records/13952449.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143001804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DisGeNet: a disease-centric interaction database among diseases and various associated genes. DisGeNet:疾病和各种相关基因之间以疾病为中心的相互作用数据库。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-01-11 DOI: 10.1093/database/baae122
Yaxuan Hu, Xingli Guo, Yao Yun, Liang Lu, Xiaotai Huang, Songwei Jia

The pathogenesis of complex diseases is intricately linked to various genes and network medicine has enhanced understanding of diseases. However, most network-based approaches ignore interactions mediated by noncoding RNAs (ncRNAs) and most databases only focus on the association between genes and diseases. Based on the mentioned questions, we have developed DisGeNet, a database focuses not only on the disease-associated genes but also on the interactions among genes. Here, the associations between diseases and various genes, as well as the interactions among these genes are integrated into a disease-centric network. As a result, there are a total of 502 688 interactions/associations involving 6697 diseases, 5780 lncRNAs (long noncoding RNAs), 16 135 protein-coding genes, and 2610 microRNAs stored in DisGeNet. These interactions/associations can be categorized as protein-protein, lncRNA-disease, microRNA-gene, microRNA-disease, gene-disease, and microRNA-lncRNA. Furthermore, as users input name/ID of diseases/genes for search, the interactions/associations about the search content can be browsed as a list or viewed in a local network-view. Database URL: https://disgenet.cn/.

复杂疾病的发病机制与多种基因有着错综复杂的联系,网络医学提高了人们对疾病的认识。然而,大多数基于网络的方法忽略了非编码rna (ncRNAs)介导的相互作用,大多数数据库只关注基因与疾病之间的关联。基于上述问题,我们开发了DisGeNet,这是一个不仅关注疾病相关基因,而且关注基因之间相互作用的数据库。在这里,疾病与各种基因之间的联系以及这些基因之间的相互作用被整合到一个以疾病为中心的网络中。结果,共有502688个相互作用/关联涉及6697种疾病,5780个lncRNAs(长链非编码rna), 16135个蛋白质编码基因和2610个microRNAs存储在DisGeNet中。这些相互作用/关联可归类为蛋白质-蛋白质、lncrna -疾病、microrna -基因、microrna -疾病、基因-疾病和microRNA-lncRNA。此外,当用户输入疾病/基因的名称/ID进行搜索时,有关搜索内容的交互/关联可以作为列表浏览或在本地网络视图中查看。数据库地址:https://disgenet.cn/。
{"title":"DisGeNet: a disease-centric interaction database among diseases and various associated genes.","authors":"Yaxuan Hu, Xingli Guo, Yao Yun, Liang Lu, Xiaotai Huang, Songwei Jia","doi":"10.1093/database/baae122","DOIUrl":"10.1093/database/baae122","url":null,"abstract":"<p><p>The pathogenesis of complex diseases is intricately linked to various genes and network medicine has enhanced understanding of diseases. However, most network-based approaches ignore interactions mediated by noncoding RNAs (ncRNAs) and most databases only focus on the association between genes and diseases. Based on the mentioned questions, we have developed DisGeNet, a database focuses not only on the disease-associated genes but also on the interactions among genes. Here, the associations between diseases and various genes, as well as the interactions among these genes are integrated into a disease-centric network. As a result, there are a total of 502 688 interactions/associations involving 6697 diseases, 5780 lncRNAs (long noncoding RNAs), 16 135 protein-coding genes, and 2610 microRNAs stored in DisGeNet. These interactions/associations can be categorized as protein-protein, lncRNA-disease, microRNA-gene, microRNA-disease, gene-disease, and microRNA-lncRNA. Furthermore, as users input name/ID of diseases/genes for search, the interactions/associations about the search content can be browsed as a list or viewed in a local network-view. Database URL: https://disgenet.cn/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724190/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142964005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HoloFood Data Portal: holo-omic datasets for analysing host-microbiota interactions in animal production. 全息食品数据门户:用于分析动物生产中宿主-微生物群相互作用的全息数据集。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-01-11 DOI: 10.1093/database/baae112
Alexander B Rogers, Varsha Kale, Germana Baldi, Antton Alberdi, M Thomas P Gilbert, Dipayan Gupta, Morten T Limborg, Sen Li, Thomas Payne, Bent Petersen, Jacob A Rasmussen, Lorna Richardson, Robert D Finn

The HoloFood project used a hologenomic approach to understand the impact of host-microbiota interactions on salmon and chicken production by analysing multiomic data, phenotypic characteristics, and associated metadata in response to novel feeds. The project's raw data, derived analyses, and metadata are deposited in public, open archives (BioSamples, European Nucleotide Archive, MetaboLights, and MGnify), so making use of these diverse data types may require access to multiple resources. This is especially complex where analysis pipelines produce derived outputs such as functional profiles or genome catalogues. The HoloFood Data Portal is a web resource that simplifies access to the project datasets. For example, users can conveniently access multiomic datasets derived from the same individual or retrieve host phenotypic data with a linked gut microbiome sample. Project-specific metagenome-assembled genome and viral catalogues are also provided, linking to broader datasets in MGnify. The portal stores only data necessary to provide these relationships, with possible linking to the underlying repositories. The portal showcases a model approach for how future multiomics datasets can be made available. Database URL:  https://www.holofooddata.org.

HoloFood项目通过分析多组学数据、表型特征和对新饲料的相关元数据,使用全基因组学方法来了解宿主-微生物群相互作用对鲑鱼和鸡肉生产的影响。该项目的原始数据、衍生分析和元数据存储在公共、开放的档案中(BioSamples、European Nucleotide Archive、MetaboLights和MGnify),因此使用这些不同的数据类型可能需要访问多个资源。在分析管道产生衍生输出(如功能概况或基因组目录)的情况下,这尤其复杂。HoloFood数据门户是一个网络资源,简化了对项目数据集的访问。例如,用户可以方便地访问来自同一个体的多组数据集,或者检索具有关联肠道微生物组样本的宿主表型数据。还提供了特定项目的宏基因组组装基因组和病毒目录,链接到MGnify中更广泛的数据集。门户只存储提供这些关系所需的数据,并可能链接到底层存储库。门户网站展示了如何提供未来多组学数据集的模型方法。数据库地址:https://www.holofooddata.org。
{"title":"HoloFood Data Portal: holo-omic datasets for analysing host-microbiota interactions in animal production.","authors":"Alexander B Rogers, Varsha Kale, Germana Baldi, Antton Alberdi, M Thomas P Gilbert, Dipayan Gupta, Morten T Limborg, Sen Li, Thomas Payne, Bent Petersen, Jacob A Rasmussen, Lorna Richardson, Robert D Finn","doi":"10.1093/database/baae112","DOIUrl":"10.1093/database/baae112","url":null,"abstract":"<p><p>The HoloFood project used a hologenomic approach to understand the impact of host-microbiota interactions on salmon and chicken production by analysing multiomic data, phenotypic characteristics, and associated metadata in response to novel feeds. The project's raw data, derived analyses, and metadata are deposited in public, open archives (BioSamples, European Nucleotide Archive, MetaboLights, and MGnify), so making use of these diverse data types may require access to multiple resources. This is especially complex where analysis pipelines produce derived outputs such as functional profiles or genome catalogues. The HoloFood Data Portal is a web resource that simplifies access to the project datasets. For example, users can conveniently access multiomic datasets derived from the same individual or retrieve host phenotypic data with a linked gut microbiome sample. Project-specific metagenome-assembled genome and viral catalogues are also provided, linking to broader datasets in MGnify. The portal stores only data necessary to provide these relationships, with possible linking to the underlying repositories. The portal showcases a model approach for how future multiomics datasets can be made available. Database URL:  https://www.holofooddata.org.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724189/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142964008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GeniePool 2.0: advancing variant analysis through CHM13-T2T, AlphaMissense, gnomAD V4 integration, and variant co-occurrence queries. GeniePool 2.0:通过CHM13-T2T、AlphaMissense、gnomAD V4集成和变体共现查询推进变体分析。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-27 DOI: 10.1093/database/baae130
Grisha Weintraub, Noam Hadar, Ehud Gudes, Shlomi Dolev, Ohad S Birk

Originally developed to meet the challenges of genomic data deluge, GeniePool emerged as a pioneering platform, enabling efficient storage, accessibility, and analysis of vast genomic datasets, enabled due to its data lake architecture. Building on this foundation, GeniePool 2.0 advances genomic analysis through the integration of cutting-edge variant databases, such as CHM13-T2T, AlphaMissense, and gnomAD V4, coupled with the capability for variant co-occurrence queries. This evolution offers an unprecedented level of granularity and scope in genomic analyses, from enhancing our understanding of variant pathogenicity and phenotypic associations to facilitating research collaborations. The introduction of CHM13-T2T provides a more accurate reference for human genetic variation, AlphaMissense enriches the platform with protein-level impact predictions of missense mutations, and gnomAD V4 offers a comprehensive view of human genetic diversity. Additionally, the innovative feature for variant co-occurrence analysis is pivotal for exploring the combined effects of genetic variations, advancing our comprehension of compound heterozygosity, epistasis, and polygenic risk factors in disease pathogenesis. GeniePool 2.0 is a comprehensive and scalable platform, which aims to enhance genomic data analysis and contribute to genomic research, potentially supporting new discoveries and clinical innovations. Database URL: https://GeniePool.link.

genepool最初是为了应对基因组数据泛滥的挑战而开发的,作为一个开创性的平台,由于其数据湖架构,它可以高效地存储、访问和分析大量基因组数据集。在此基础上,GeniePool 2.0通过集成尖端的变体数据库(如CHM13-T2T、AlphaMissense和gnomAD V4)以及变体共现查询功能,推进了基因组分析。这种进化为基因组分析提供了前所未有的粒度和范围,从增强我们对变异致病性和表型关联的理解到促进研究合作。CHM13-T2T的引入为人类遗传变异提供了更准确的参考,AlphaMissense为错义突变的蛋白水平影响预测提供了丰富的平台,gnomAD V4为人类遗传多样性提供了全面的视角。此外,变异共现分析的创新功能对于探索遗传变异的综合效应,促进我们对疾病发病机制中的复合杂合性、上位性和多基因危险因素的理解至关重要。genepool 2.0是一个全面且可扩展的平台,旨在增强基因组数据分析并为基因组研究做出贡献,潜在地支持新发现和临床创新。数据库地址:https://GeniePool.link。
{"title":"GeniePool 2.0: advancing variant analysis through CHM13-T2T, AlphaMissense, gnomAD V4 integration, and variant co-occurrence queries.","authors":"Grisha Weintraub, Noam Hadar, Ehud Gudes, Shlomi Dolev, Ohad S Birk","doi":"10.1093/database/baae130","DOIUrl":"10.1093/database/baae130","url":null,"abstract":"<p><p>Originally developed to meet the challenges of genomic data deluge, GeniePool emerged as a pioneering platform, enabling efficient storage, accessibility, and analysis of vast genomic datasets, enabled due to its data lake architecture. Building on this foundation, GeniePool 2.0 advances genomic analysis through the integration of cutting-edge variant databases, such as CHM13-T2T, AlphaMissense, and gnomAD V4, coupled with the capability for variant co-occurrence queries. This evolution offers an unprecedented level of granularity and scope in genomic analyses, from enhancing our understanding of variant pathogenicity and phenotypic associations to facilitating research collaborations. The introduction of CHM13-T2T provides a more accurate reference for human genetic variation, AlphaMissense enriches the platform with protein-level impact predictions of missense mutations, and gnomAD V4 offers a comprehensive view of human genetic diversity. Additionally, the innovative feature for variant co-occurrence analysis is pivotal for exploring the combined effects of genetic variations, advancing our comprehension of compound heterozygosity, epistasis, and polygenic risk factors in disease pathogenesis. GeniePool 2.0 is a comprehensive and scalable platform, which aims to enhance genomic data analysis and contribute to genomic research, potentially supporting new discoveries and clinical innovations. Database URL: https://GeniePool.link.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11673193/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142892502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AneRBC dataset: a benchmark dataset for computer-aided anemia diagnosis using RBC images. AneRBC数据集:使用红细胞图像进行计算机辅助贫血诊断的基准数据集。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-25 DOI: 10.1093/database/baae120
Muhammad Shahzad, Syed Hamad Shirazi, Muhammad Yaqoob, Zakir Khan, Assad Rasheed, Israr Ahmed Sheikh, Asad Hayat, Huiyu Zhou

Visual analysis of peripheral blood smear slides using medical image analysis is required to diagnose red blood cell (RBC) morphological deformities caused by anemia. The absence of a complete anaemic RBC dataset has hindered the training and testing of deep convolutional neural networks (CNNs) for computer-aided analysis of RBC morphology. We introduce a benchmark RBC image dataset named Anemic RBC (AneRBC) to overcome this problem. This dataset is divided into two versions: AneRBC-I and AneRBC-II. AneRBC-I contains 1000 microscopic images, including 500 healthy and 500 anaemic images with 1224 × 960 pixel resolution, along with manually generated ground truth of each image. Each image contains approximately 1550 RBC elements, including normocytes, microcytes, macrocytes, elliptocytes, and target cells, resulting in a total of approximately 1 550 000 RBC elements. The dataset also includes each image's complete blood count and morphology reports to validate the CNN model results with clinical data. Under the supervision of a team of expert pathologists, the annotation, labeling, and ground truth for each image were generated. Due to the high resolution, each image was divided into 12 subimages with ground truth and incorporated into AneRBC-II. AneRBC-II comprises a total of 12 000 images, comprising 6000 original and 6000 anaemic RBC images. Four state-of-the-art CNN models were applied for segmentation and classification to validate the proposed dataset. Database URL: https://data.mendeley.com/preview/hms3sjzt7f?a=4d0ba42a-cc6f-4777-adc4-2552e80db22b.

利用医学图像分析技术对外周血涂片进行视觉分析,诊断贫血引起的红细胞形态畸形。缺乏完整的贫血红细胞数据集阻碍了深度卷积神经网络(cnn)用于红细胞形态计算机辅助分析的训练和测试。为了克服这个问题,我们引入了一个名为贫血红细胞(AneRBC)的基准RBC图像数据集。该数据集分为两个版本:AneRBC-I和AneRBC-II。AneRBC-I包含1000张显微图像,包括500张健康和500张贫血图像,分辨率为1224 × 960像素,以及手动生成的每张图像的地面真值。每张图像包含大约1550个RBC元素,包括正常细胞、微细胞、巨细胞、椭圆细胞和靶细胞,总共大约有1550000个RBC元素。该数据集还包括每个图像的完整血细胞计数和形态学报告,以验证CNN模型结果与临床数据。在一组专家病理学家的监督下,生成每个图像的注释、标记和基础真值。由于分辨率高,每张图像被划分为12个具有ground truth的子图像,合并到AneRBC-II中。AneRBC-II共包含12000张图像,包括6000张原始红细胞图像和6000张贫血红细胞图像。应用四种最先进的CNN模型进行分割和分类,以验证所提出的数据集。数据库地址:https://data.mendeley.com/preview/hms3sjzt7f?a=4d0ba42a-cc6f-4777-adc4-2552e80db22b。
{"title":"AneRBC dataset: a benchmark dataset for computer-aided anemia diagnosis using RBC images.","authors":"Muhammad Shahzad, Syed Hamad Shirazi, Muhammad Yaqoob, Zakir Khan, Assad Rasheed, Israr Ahmed Sheikh, Asad Hayat, Huiyu Zhou","doi":"10.1093/database/baae120","DOIUrl":"https://doi.org/10.1093/database/baae120","url":null,"abstract":"<p><p>Visual analysis of peripheral blood smear slides using medical image analysis is required to diagnose red blood cell (RBC) morphological deformities caused by anemia. The absence of a complete anaemic RBC dataset has hindered the training and testing of deep convolutional neural networks (CNNs) for computer-aided analysis of RBC morphology. We introduce a benchmark RBC image dataset named Anemic RBC (AneRBC) to overcome this problem. This dataset is divided into two versions: AneRBC-I and AneRBC-II. AneRBC-I contains 1000 microscopic images, including 500 healthy and 500 anaemic images with 1224 × 960 pixel resolution, along with manually generated ground truth of each image. Each image contains approximately 1550 RBC elements, including normocytes, microcytes, macrocytes, elliptocytes, and target cells, resulting in a total of approximately 1 550 000 RBC elements. The dataset also includes each image's complete blood count and morphology reports to validate the CNN model results with clinical data. Under the supervision of a team of expert pathologists, the annotation, labeling, and ground truth for each image were generated. Due to the high resolution, each image was divided into 12 subimages with ground truth and incorporated into AneRBC-II. AneRBC-II comprises a total of 12 000 images, comprising 6000 original and 6000 anaemic RBC images. Four state-of-the-art CNN models were applied for segmentation and classification to validate the proposed dataset. Database URL: https://data.mendeley.com/preview/hms3sjzt7f?a=4d0ba42a-cc6f-4777-adc4-2552e80db22b.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142892479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MiCK: a database of gut microbial genes linked with chemoresistance in cancer patients. MiCK:与癌症患者化疗耐药相关的肠道微生物基因数据库。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-21 DOI: 10.1093/database/baae124
Muhammad Shahzaib, Muhammad Muaz, Muhammad Hasnain Zubair, Masood Ur Rehman Kayani

Cancer remains a global health challenge, with significant morbidity and mortality rates. In 2020, cancer caused nearly 10 million deaths, making it the second leading cause of death worldwide. The emergence of chemoresistance has become a major hurdle in successfully treating cancer patients. Recently, human gut microbes have been recognized for their role in modulating drug efficacy through their metabolites, ultimately leading to chemoresistance. The currently available databases are limited to knowledge regarding the interactions between gut microbiome and drugs. However, a database containing the human gut microbial gene sequences, and their effect on the efficacy of chemotherapy for cancer patients has not yet been developed. To address this challenge, we present the Microbial Chemoresistance Knowledgebase (MiCK), a comprehensive database that catalogs microbial gene sequences associated with chemoresistance. MiCK contains 1.6 million sequences of 29 gene types linked to chemoresistance and drug metabolism, curated manually from recent literature and sequence databases. The database can support downstream analysis as it provides a user-friendly web interface for sequence search and download functionalities. MiCK aims to facilitate the understanding and mitigation of chemoresistance in cancers by serving as a valuable resource for researchers. Database URL: https://microbialchemreskb.com/.

癌症仍然是一项全球健康挑战,发病率和死亡率都很高。2020年,癌症导致近1000万人死亡,成为全球第二大死因。化疗耐药性的出现已经成为成功治疗癌症患者的主要障碍。最近,人类肠道微生物通过其代谢物调节药物疗效,最终导致药物耐药。目前可用的数据库仅限于关于肠道微生物群与药物之间相互作用的知识。然而,包含人类肠道微生物基因序列及其对癌症患者化疗疗效影响的数据库尚未建立。为了应对这一挑战,我们提出了微生物化学耐药知识库(MiCK),这是一个综合数据库,编目了与化学耐药相关的微生物基因序列。MiCK包含160万个与化学耐药和药物代谢相关的29种基因类型的序列,这些序列是从最近的文献和序列数据库中手动整理出来的。该数据库可以支持下游分析,因为它为序列搜索和下载功能提供了一个用户友好的web界面。MiCK旨在通过为研究人员提供宝贵的资源,促进对癌症化疗耐药的理解和缓解。数据库地址:https://microbialchemreskb.com/。
{"title":"MiCK: a database of gut microbial genes linked with chemoresistance in cancer patients.","authors":"Muhammad Shahzaib, Muhammad Muaz, Muhammad Hasnain Zubair, Masood Ur Rehman Kayani","doi":"10.1093/database/baae124","DOIUrl":"10.1093/database/baae124","url":null,"abstract":"<p><p>Cancer remains a global health challenge, with significant morbidity and mortality rates. In 2020, cancer caused nearly 10 million deaths, making it the second leading cause of death worldwide. The emergence of chemoresistance has become a major hurdle in successfully treating cancer patients. Recently, human gut microbes have been recognized for their role in modulating drug efficacy through their metabolites, ultimately leading to chemoresistance. The currently available databases are limited to knowledge regarding the interactions between gut microbiome and drugs. However, a database containing the human gut microbial gene sequences, and their effect on the efficacy of chemotherapy for cancer patients has not yet been developed. To address this challenge, we present the Microbial Chemoresistance Knowledgebase (MiCK), a comprehensive database that catalogs microbial gene sequences associated with chemoresistance. MiCK contains 1.6 million sequences of 29 gene types linked to chemoresistance and drug metabolism, curated manually from recent literature and sequence databases. The database can support downstream analysis as it provides a user-friendly web interface for sequence search and download functionalities. MiCK aims to facilitate the understanding and mitigation of chemoresistance in cancers by serving as a valuable resource for researchers. Database URL: https://microbialchemreskb.com/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11662283/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142871629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JTIS: enhancing biomedical document-level relation extraction through joint training with intermediate steps. JTIS:通过中间步骤联合训练,加强生物医学文献级关系提取。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-19 DOI: 10.1093/database/baae125
Jiru Li, Dinghao Pan, Zhihao Yang, Yuanyuan Sun, Hongfei Lin, Jian Wang

Biomedical Relation Extraction (RE) is central to Biomedical Natural Language Processing and is crucial for various downstream applications. Existing RE challenges in the field of biology have primarily focused on intra-sentential analysis. However, with the rapid increase in the volume of literature and the complexity of relationships between biomedical entities, it often becomes necessary to consider multiple sentences to fully extract the relationship between a pair of entities. Current methods often fail to fully capture the complex semantic structures of information in documents, thereby affecting extraction accuracy. Therefore, unlike traditional RE methods that rely on sentence-level analysis and heuristic rules, our method focuses on extracting entity relationships from biomedical literature titles and abstracts and classifying relations that are novel findings. In our method, a multitask training approach is employed for fine-tuning a Pre-trained Language Model in the field of biology. Based on a broad spectrum of carefully designed tasks, our multitask method not only extracts relations of better quality due to more effective supervision but also achieves a more accurate classification of whether the entity pairs are novel findings. Moreover, by applying a model ensemble method, we further enhance our model's performance. The extensive experiments demonstrate that our method achieves significant performance improvements, i.e. surpassing the existing baseline by 3.94% in RE and 3.27% in Triplet Novel Typing in F1 score on BioRED, confirming its effectiveness in handling complex biomedical literature RE tasks. Database URL: https://codalab.lisn.upsaclay.fr/competitions/13377#learn_the_details-dataset.

生物医学关系提取(RE)是生物医学自然语言处理的核心,对各种下游应用至关重要。生物领域现有的RE挑战主要集中在句内分析上。然而,随着文献量的迅速增加和生物医学实体之间关系的复杂性,为了充分提取一对实体之间的关系,往往需要考虑多个句子。目前的方法往往不能完全捕获文档中信息的复杂语义结构,从而影响提取的准确性。因此,与传统的依赖于句子级分析和启发式规则的RE方法不同,我们的方法侧重于从生物医学文献标题和摘要中提取实体关系,并对新发现的关系进行分类。在我们的方法中,采用多任务训练方法对生物学领域的预训练语言模型进行微调。基于广泛的精心设计的任务,我们的多任务方法不仅由于更有效的监督而提取出质量更好的关系,而且还实现了对实体对是否为新发现的更准确分类。此外,通过应用模型集成方法,进一步提高了模型的性能。大量的实验表明,我们的方法取得了显着的性能改进,即在生物医学文献的F1评分中,RE比现有基线高出3.94%,Triplet Novel Typing比现有基线高出3.27%,证实了它在处理复杂的生物医学文献RE任务方面的有效性。数据库地址:https://codalab.lisn.upsaclay.fr/competitions/13377#learn_the_details-dataset。
{"title":"JTIS: enhancing biomedical document-level relation extraction through joint training with intermediate steps.","authors":"Jiru Li, Dinghao Pan, Zhihao Yang, Yuanyuan Sun, Hongfei Lin, Jian Wang","doi":"10.1093/database/baae125","DOIUrl":"10.1093/database/baae125","url":null,"abstract":"<p><p>Biomedical Relation Extraction (RE) is central to Biomedical Natural Language Processing and is crucial for various downstream applications. Existing RE challenges in the field of biology have primarily focused on intra-sentential analysis. However, with the rapid increase in the volume of literature and the complexity of relationships between biomedical entities, it often becomes necessary to consider multiple sentences to fully extract the relationship between a pair of entities. Current methods often fail to fully capture the complex semantic structures of information in documents, thereby affecting extraction accuracy. Therefore, unlike traditional RE methods that rely on sentence-level analysis and heuristic rules, our method focuses on extracting entity relationships from biomedical literature titles and abstracts and classifying relations that are novel findings. In our method, a multitask training approach is employed for fine-tuning a Pre-trained Language Model in the field of biology. Based on a broad spectrum of carefully designed tasks, our multitask method not only extracts relations of better quality due to more effective supervision but also achieves a more accurate classification of whether the entity pairs are novel findings. Moreover, by applying a model ensemble method, we further enhance our model's performance. The extensive experiments demonstrate that our method achieves significant performance improvements, i.e. surpassing the existing baseline by 3.94% in RE and 3.27% in Triplet Novel Typing in F1 score on BioRED, confirming its effectiveness in handling complex biomedical literature RE tasks. Database URL: https://codalab.lisn.upsaclay.fr/competitions/13377#learn_the_details-dataset.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11658465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142863576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Database: The Journal of Biological Databases and Curation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1