Pub Date : 2024-05-28DOI: 10.1093/database/baae039
Charlotte Nachtegael, Jacopo De Stefani, Anthony Cnudde, Tom Lenaerts
While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.
生物医学关系提取(bioRE)数据集有助于开发支持从文本中提取单个变异体的生物化方法,但尽管有文献报道不同位点(或基因)变异体组合之间的表观效应对了解疾病病因很重要,但目前还没有数据集可用于提取双基因甚至寡基因变异体关系。这项工作展示了一个独特的寡源变异组合数据集的创建过程,该数据集旨在培训有助于科学文献整理的工具。为了克服与未标注实例数量和专业知识成本相关的障碍,我们采用了主动学习(AL)来优化标注,从而帮助找到信息量最大的标注样本子集。通过使用 PubTator 对包含寡核苷酸疾病数据库(OLIDA)中相关关系的 85 篇全文文章进行预标注,提取出具有潜在二基因变异组合(即基因-变异体-基因-变异体)特征的文本片段。由此产生的文本片段使用基于 AL 的注释平台 ALAMBIC 进行注释。得到的数据集称为 DUVEL,用于微调四种最先进的生物医学语言模型:BiomedBERT、BiomedBERT-large、BioLinkBERT 和 BioM-BERT。在标注过程中考虑了 500 000 多个文本片段,最终形成了一个包含 8442 个片段的数据集,其中 794 个为正例,覆盖了原始标注文章的 95%。在应用于基因变异对检测时,BiomedBERT-large 在微调后获得了最高的 F1 分数(0.84),与未微调的模型相比有了显著改善,突出了 DUVEL 数据集的相关性。这项研究显示了 AL 如何在创建生物RE 数据集的过程中发挥重要作用,使其适用于生物医学研究应用。DUVEL 提供了一个独特的生物医学语料库,侧重于两个基因和两个变体之间的 4ary 关系。该语料库在 GitHub 和 Hugging Face 上免费供研究使用。数据库网址:https://huggingface.co/datasets/cnachteg/duvel 或 https://doi.org/10.57967/hf/1571。
{"title":"DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.","authors":"Charlotte Nachtegael, Jacopo De Stefani, Anthony Cnudde, Tom Lenaerts","doi":"10.1093/database/baae039","DOIUrl":"10.1093/database/baae039","url":null,"abstract":"<p><p>While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11131422/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141160776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-27DOI: 10.1093/database/baae041
Preeti Choudhary, Zukang Feng, John Berrisford, Henry Chao, Yasuyo Ikegawa, Ezra Peisach, Dennis W Piehl, James Smith, Ahsan Tanweer, Mihaly Varadi, John D Westbrook, Jasmine Y Young, Ardan Patwardhan, Kyle L Morris, Jeffrey C Hoch, Genji Kurisu, Sameer Velankar, Stephen K Burley
The Protein Data Bank (PDB) is the global repository for public-domain experimentally determined 3D biomolecular structural information. The archival nature of the PDB presents certain challenges pertaining to updating or adding associated annotations from trusted external biodata resources. While each Worldwide PDB (wwPDB) partner has made best efforts to provide up-to-date external annotations, accessing and integrating information from disparate wwPDB data centers can be an involved process. To address this issue, the wwPDB has established the PDB Next Generation (or NextGen) Archive, developed to centralize and streamline access to enriched structural annotations from wwPDB partners and trusted external sources. At present, the NextGen Archive provides mappings between experimentally determined 3D structures of proteins and UniProt amino acid sequences, domain annotations from Pfam, SCOP2 and CATH databases and intra-molecular connectivity information. Since launch, the PDB NextGen Archive has seen substantial user engagement with over 3.5 million data file downloads, ensuring researchers have access to accurate, up-to-date and easily accessible structural annotations. Database URL: http://www.wwpdb.org/ftp/pdb-nextgen-archive-site.
{"title":"PDB NextGen Archive: centralizing access to integrated annotations and enriched structural information by the Worldwide Protein Data Bank.","authors":"Preeti Choudhary, Zukang Feng, John Berrisford, Henry Chao, Yasuyo Ikegawa, Ezra Peisach, Dennis W Piehl, James Smith, Ahsan Tanweer, Mihaly Varadi, John D Westbrook, Jasmine Y Young, Ardan Patwardhan, Kyle L Morris, Jeffrey C Hoch, Genji Kurisu, Sameer Velankar, Stephen K Burley","doi":"10.1093/database/baae041","DOIUrl":"10.1093/database/baae041","url":null,"abstract":"<p><p>The Protein Data Bank (PDB) is the global repository for public-domain experimentally determined 3D biomolecular structural information. The archival nature of the PDB presents certain challenges pertaining to updating or adding associated annotations from trusted external biodata resources. While each Worldwide PDB (wwPDB) partner has made best efforts to provide up-to-date external annotations, accessing and integrating information from disparate wwPDB data centers can be an involved process. To address this issue, the wwPDB has established the PDB Next Generation (or NextGen) Archive, developed to centralize and streamline access to enriched structural annotations from wwPDB partners and trusted external sources. At present, the NextGen Archive provides mappings between experimentally determined 3D structures of proteins and UniProt amino acid sequences, domain annotations from Pfam, SCOP2 and CATH databases and intra-molecular connectivity information. Since launch, the PDB NextGen Archive has seen substantial user engagement with over 3.5 million data file downloads, ensuring researchers have access to accurate, up-to-date and easily accessible structural annotations. Database URL: http://www.wwpdb.org/ftp/pdb-nextgen-archive-site.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11130521/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141158130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-27DOI: 10.1093/database/baae038
Yuanyuan Wang, Yexin Yang, Yi Liu, Chao Liu, Meng Xu, Miao Fang, Xidong Mu
Fish, being a crucial component of aquatic ecosystems, holds significant importance from both economic and ecological perspectives. However, the identification of fish at the species level remains challenging, and there is a lack of a taxonomically complete and comprehensive reference sequence database for fish. Therefore, we developed CoSFISH, an online fish database. Currently, the database contains 21 535 cytochrome oxidase I sequences and 1074 18S rRNA sequences of 21 589 species, belonging to 8 classes and 90 orders. We additionally incorporate online analysis tools to aid users in comparing, aligning and analyzing sequences, as well as designing primers. Users can upload their own data for analysis, in addition to using the data stored in the database directly. CoSFISH offers an extensive fish database and incorporates online analysis tools, making it a valuable resource for the study of fish diversity, phylogenetics and biological evolution. Database URL: http://210.22.121.250:8888/CoSFISH/home/indexPage.
{"title":"CoSFISH: a comprehensive reference database of COI and 18S rRNA barcodes for fish.","authors":"Yuanyuan Wang, Yexin Yang, Yi Liu, Chao Liu, Meng Xu, Miao Fang, Xidong Mu","doi":"10.1093/database/baae038","DOIUrl":"10.1093/database/baae038","url":null,"abstract":"<p><p>Fish, being a crucial component of aquatic ecosystems, holds significant importance from both economic and ecological perspectives. However, the identification of fish at the species level remains challenging, and there is a lack of a taxonomically complete and comprehensive reference sequence database for fish. Therefore, we developed CoSFISH, an online fish database. Currently, the database contains 21 535 cytochrome oxidase I sequences and 1074 18S rRNA sequences of 21 589 species, belonging to 8 classes and 90 orders. We additionally incorporate online analysis tools to aid users in comparing, aligning and analyzing sequences, as well as designing primers. Users can upload their own data for analysis, in addition to using the data stored in the database directly. CoSFISH offers an extensive fish database and incorporates online analysis tools, making it a valuable resource for the study of fish diversity, phylogenetics and biological evolution. Database URL: http://210.22.121.250:8888/CoSFISH/home/indexPage.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11130519/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141158172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple sclerosis (MS) is the most common inflammatory demyelinating disease of the central nervous system. 'Omics' technologies (genomics, transcriptomics, proteomics) and associated drug information have begun reshaping our understanding of multiple sclerosis. However, these data are scattered across numerous references, making them challenging to fully utilize. We manually mined and compiled these data within the Multiple Sclerosis Gene Database (MSGD) database, intending to continue updating it in the future. We screened 5485 publications and constructed the current version of MSGD. MSGD comprises 6255 entries, including 3274 variant entries, 1175 RNA entries, 418 protein entries, 313 knockout entries, 612 drug entries and 463 high-throughput entries. Each entry contains detailed information, such as species, disease type, detailed gene descriptions (such as official gene symbols), and original references. MSGD is freely accessible and provides a user-friendly web interface. Users can easily search for genes of interest, view their expression patterns and detailed information, manage gene sets and submit new MS-gene associations through the platform. The primary principle behind MSGD's design is to provide an exploratory platform, aiming to minimize filtration and interpretation barriers while ensuring highly accessible presentation of data. This initiative is expected to significantly assist researchers in deciphering gene mechanisms and improving the prevention, diagnosis and treatment of MS. Database URL: http://bio-bigdata.hrbmu.edu.cn/MSGD.
{"title":"MSGD: a manually curated database of genomic, transcriptomic, proteomic and drug information for multiple sclerosis.","authors":"Tao Wu, Yaopan Hou, Guanghao Xin, Jingyan Niu, Shanshan Peng, Fanfan Xu, Ying Li, Yuling Chen, Yifangfei Yu, Huixue Zhang, Xiaotong Kong, Yuze Cao, Shangwei Ning, Lihua Wang, Junwei Hao","doi":"10.1093/database/baae037","DOIUrl":"10.1093/database/baae037","url":null,"abstract":"<p><p>Multiple sclerosis (MS) is the most common inflammatory demyelinating disease of the central nervous system. 'Omics' technologies (genomics, transcriptomics, proteomics) and associated drug information have begun reshaping our understanding of multiple sclerosis. However, these data are scattered across numerous references, making them challenging to fully utilize. We manually mined and compiled these data within the Multiple Sclerosis Gene Database (MSGD) database, intending to continue updating it in the future. We screened 5485 publications and constructed the current version of MSGD. MSGD comprises 6255 entries, including 3274 variant entries, 1175 RNA entries, 418 protein entries, 313 knockout entries, 612 drug entries and 463 high-throughput entries. Each entry contains detailed information, such as species, disease type, detailed gene descriptions (such as official gene symbols), and original references. MSGD is freely accessible and provides a user-friendly web interface. Users can easily search for genes of interest, view their expression patterns and detailed information, manage gene sets and submit new MS-gene associations through the platform. The primary principle behind MSGD's design is to provide an exploratory platform, aiming to minimize filtration and interpretation barriers while ensuring highly accessible presentation of data. This initiative is expected to significantly assist researchers in deciphering gene mechanisms and improving the prevention, diagnosis and treatment of MS. Database URL: http://bio-bigdata.hrbmu.edu.cn/MSGD.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11126313/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141093077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Natural products play a pivotal role in drug discovery, and the richness of natural products, albeit significantly influenced by various environmental factors, is predominantly determined by intrinsic genetics of a series of enzymatic reactions and produced as secondary metabolites of organisms. Heretofore, few natural product-related databases take the chemical content into consideration as a prominent property. To gain unique insights into the quantitative diversity of natural products, we have developed the first TerPenoids database embedded with Content information (TPCN) with features such as compound browsing, structural search, scaffold analysis, similarity analysis and data download. This database can be accessed through a web-based computational toolkit available at http://www.tpcn.pro/. By conducting meticulous manual searches and analyzing over 10 000 reference papers, the TPCN database has successfully integrated 6383 terpenoids obtained from 1254 distinct plant species. The database encompasses exhaustive details including isolation parts, comprehensive molecule structures, chemical abstracts service registry number (CAS number) and 7508 content descriptions. The TPCN database accentuates both the qualitative and quantitative dimensions as invaluable phenotypic characteristics of natural products that have undergone genetic evolution. By acting as an indispensable criterion, the TPCN database facilitates the discovery of drug alternatives with high content and the selection of high-yield medicinal plant species or phylogenetic alternatives, thereby fostering sustainable, cost-effective and environmentally friendly drug discovery in pharmaceutical farming. Database URL: http://www.tpcn.pro/.
{"title":"A terpenoids database with the chemical content as a novel agronomic trait.","authors":"Wenqian Li, Yinliang Chen, Ruofei Yang, Zilong Hu, Shaozhong Wei, Sheng Hu, Xinjun Xiong, Meijuan Wang, Ammar Lubeiny, Xiaohua Li, Minglei Feng, Shuang Dong, Xinlu Xie, Chao Nie, Jingyi Zhang, Yunhao Luo, Yichen Zhou, Ruodi Liu, Jinhai Pan, De-Xin Kong, Xuebo Hu","doi":"10.1093/database/baae027","DOIUrl":"10.1093/database/baae027","url":null,"abstract":"<p><p>Natural products play a pivotal role in drug discovery, and the richness of natural products, albeit significantly influenced by various environmental factors, is predominantly determined by intrinsic genetics of a series of enzymatic reactions and produced as secondary metabolites of organisms. Heretofore, few natural product-related databases take the chemical content into consideration as a prominent property. To gain unique insights into the quantitative diversity of natural products, we have developed the first TerPenoids database embedded with Content information (TPCN) with features such as compound browsing, structural search, scaffold analysis, similarity analysis and data download. This database can be accessed through a web-based computational toolkit available at http://www.tpcn.pro/. By conducting meticulous manual searches and analyzing over 10 000 reference papers, the TPCN database has successfully integrated 6383 terpenoids obtained from 1254 distinct plant species. The database encompasses exhaustive details including isolation parts, comprehensive molecule structures, chemical abstracts service registry number (CAS number) and 7508 content descriptions. The TPCN database accentuates both the qualitative and quantitative dimensions as invaluable phenotypic characteristics of natural products that have undergone genetic evolution. By acting as an indispensable criterion, the TPCN database facilitates the discovery of drug alternatives with high content and the selection of high-yield medicinal plant species or phylogenetic alternatives, thereby fostering sustainable, cost-effective and environmentally friendly drug discovery in pharmaceutical farming. Database URL: http://www.tpcn.pro/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11110934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141080781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-22DOI: 10.1093/database/baae036
Christophe Jenny, Valentin Guignon, Felip Manyer I Ballester, Max Ruas, Mathieu Rouard
The Musa Germplasm Information System (MGIS) stands as a pivotal database for managing global banana genetic resources information. In our latest effort, we have expanded MGIS to incorporate in situ observations. We thus incorporated more than 3000 in situ observations from 133 countries primarily sourced from iNaturalist, GBIF, Flickr, Pl@ntNet, Google Street view and expert curation of the literature. This addition provides a more comprehensive and detailed view of banana diversity and its distribution. Additional graphical interfaces, supported by new Drupal modules, were developed, allowing users to compare banana accessions and explore them based on various filters including taxonomy and geographic location. The integrated maps present a unified view, showcasing both in situ observations and the collecting locations of accessions held in germplasm collections. This enhancement not only broadens the scope of MGIS but also promotes a collaborative and open approach in documenting banana diversity, to allow more effective conservation and use of banana germplasm. Furthermore, this work documents a citizen-science approach that could be relevant for other communities. Database URL: https://www.crop-diversity.org/mgis/musa-in-situ.
{"title":"Collecting and managing in situ banana genetic resources information (Musa spp.) using online resources and citizen science.","authors":"Christophe Jenny, Valentin Guignon, Felip Manyer I Ballester, Max Ruas, Mathieu Rouard","doi":"10.1093/database/baae036","DOIUrl":"10.1093/database/baae036","url":null,"abstract":"<p><p>The Musa Germplasm Information System (MGIS) stands as a pivotal database for managing global banana genetic resources information. In our latest effort, we have expanded MGIS to incorporate in situ observations. We thus incorporated more than 3000 in situ observations from 133 countries primarily sourced from iNaturalist, GBIF, Flickr, Pl@ntNet, Google Street view and expert curation of the literature. This addition provides a more comprehensive and detailed view of banana diversity and its distribution. Additional graphical interfaces, supported by new Drupal modules, were developed, allowing users to compare banana accessions and explore them based on various filters including taxonomy and geographic location. The integrated maps present a unified view, showcasing both in situ observations and the collecting locations of accessions held in germplasm collections. This enhancement not only broadens the scope of MGIS but also promotes a collaborative and open approach in documenting banana diversity, to allow more effective conservation and use of banana germplasm. Furthermore, this work documents a citizen-science approach that could be relevant for other communities. Database URL: https://www.crop-diversity.org/mgis/musa-in-situ.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11110932/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141080783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-15DOI: 10.1093/database/baae034
Alberto García S, Mireia Costa, Alba García-Zarzoso, Oscar Pastor
Mutational hotspots are DNA regions with an abnormally high frequency of genetic variants. Identifying whether a variant is located in a mutational hotspot is critical for determining the variant's role in disorder predisposition, development, and treatment response. Despite their significance, current databases on mutational hotspots are limited to the oncology domain. However, identifying mutational hotspots is critical for any disorder in which genetics plays a role. This is true for the world's leading cause of death: cardiac disorders. In this work, we present CardioHotspots, a literature-based database of manually curated hotspots for cardiac diseases. This is the only database we know of that provides high-quality and easily accessible information about hotspots associated with cardiac disorders. CardioHotspots is publicly accessible via a web-based platform (https://genomics-hub.pros.dsic.upv.es:3099/). Database URL: https://genomics-hub.pros.dsic.upv.es:3099/.
突变热点是基因变异频率异常高的 DNA 区域。确定一个变体是否位于突变热点,对于确定该变体在疾病易感性、发展和治疗反应中的作用至关重要。尽管突变热点非常重要,但目前有关突变热点的数据库仅限于肿瘤学领域。然而,对于遗传学起作用的任何疾病来说,识别突变热点都是至关重要的。对于世界头号死因--心脏疾病来说,情况也是如此。在这项工作中,我们介绍了 CardioHotspots,这是一个基于文献的、人工策划的心脏疾病热点数据库。据我们所知,这是唯一一个提供与心脏疾病相关热点的高质量且易于访问的数据库。CardioHotspots 可通过网络平台 (https://genomics-hub.pros.dsic.upv.es:3099/) 公开访问。数据库网址:https://genomics-hub.pros.dsic.upv.es:3099/。
{"title":"CardioHotspots: a database of mutational hotspots for cardiac disorders.","authors":"Alberto García S, Mireia Costa, Alba García-Zarzoso, Oscar Pastor","doi":"10.1093/database/baae034","DOIUrl":"10.1093/database/baae034","url":null,"abstract":"<p><p>Mutational hotspots are DNA regions with an abnormally high frequency of genetic variants. Identifying whether a variant is located in a mutational hotspot is critical for determining the variant's role in disorder predisposition, development, and treatment response. Despite their significance, current databases on mutational hotspots are limited to the oncology domain. However, identifying mutational hotspots is critical for any disorder in which genetics plays a role. This is true for the world's leading cause of death: cardiac disorders. In this work, we present CardioHotspots, a literature-based database of manually curated hotspots for cardiac diseases. This is the only database we know of that provides high-quality and easily accessible information about hotspots associated with cardiac disorders. CardioHotspots is publicly accessible via a web-based platform (https://genomics-hub.pros.dsic.upv.es:3099/). Database URL: https://genomics-hub.pros.dsic.upv.es:3099/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":"0"},"PeriodicalIF":5.8,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11096770/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140944457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Breast cancer is notorious for its high mortality and heterogeneity, resulting in different therapeutic responses. Classical biomarkers have been identified and successfully commercially applied to predict the outcome of breast cancer patients. Accumulating biomarkers, including non-coding RNAs, have been reported as prognostic markers for breast cancer with the development of sequencing techniques. However, there are currently no databases dedicated to the curation and characterization of prognostic markers for breast cancer. Therefore, we constructed a curated database for prognostic markers of breast cancer (PMBC). PMBC consists of 1070 markers covering mRNAs, lncRNAs, miRNAs and circRNAs. These markers are enriched in various cancer- and epithelial-related functions including mitogen-activated protein kinases signaling. We mapped the prognostic markers into the ceRNA network from starBase. The lncRNA NEAT1 competes with 11 RNAs, including lncRNAs and mRNAs. The majority of the ceRNAs in ABAT belong to pseudogenes. The topology analysis of the ceRNA network reveals that known prognostic RNAs have higher closeness than random. Among all the biomarkers, prognostic lncRNAs have a higher degree, while prognostic mRNAs have significantly higher closeness than random RNAs. These results indicate that the lncRNAs play important roles in maintaining the interactions between lncRNAs and their ceRNAs, which might be used as a characteristic to prioritize prognostic lncRNAs based on the ceRNA network. PMBC renders a user-friendly interface and provides detailed information about individual prognostic markers, which will facilitate the precision treatment of breast cancer. PMBC is available at the following URL: http://www.pmbreastcancer.com/.
{"title":"PMBC: a manually curated database for prognostic markers of breast cancer.","authors":"Jiabei Liu, Yiyi Yu, Mingyue Li, Yixuan Wu, Weijun Chen, Guanru Liu, Lingxian Liu, Jiechun Lin, Chujun Peng, Weijun Sun, Xiaoli Wu, Xin Chen","doi":"10.1093/database/baae033","DOIUrl":"10.1093/database/baae033","url":null,"abstract":"<p><p>Breast cancer is notorious for its high mortality and heterogeneity, resulting in different therapeutic responses. Classical biomarkers have been identified and successfully commercially applied to predict the outcome of breast cancer patients. Accumulating biomarkers, including non-coding RNAs, have been reported as prognostic markers for breast cancer with the development of sequencing techniques. However, there are currently no databases dedicated to the curation and characterization of prognostic markers for breast cancer. Therefore, we constructed a curated database for prognostic markers of breast cancer (PMBC). PMBC consists of 1070 markers covering mRNAs, lncRNAs, miRNAs and circRNAs. These markers are enriched in various cancer- and epithelial-related functions including mitogen-activated protein kinases signaling. We mapped the prognostic markers into the ceRNA network from starBase. The lncRNA NEAT1 competes with 11 RNAs, including lncRNAs and mRNAs. The majority of the ceRNAs in ABAT belong to pseudogenes. The topology analysis of the ceRNA network reveals that known prognostic RNAs have higher closeness than random. Among all the biomarkers, prognostic lncRNAs have a higher degree, while prognostic mRNAs have significantly higher closeness than random RNAs. These results indicate that the lncRNAs play important roles in maintaining the interactions between lncRNAs and their ceRNAs, which might be used as a characteristic to prioritize prognostic lncRNAs based on the ceRNA network. PMBC renders a user-friendly interface and provides detailed information about individual prognostic markers, which will facilitate the precision treatment of breast cancer. PMBC is available at the following URL: http://www.pmbreastcancer.com/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11095525/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140944463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-07DOI: 10.1093/database/baae031
Marija Orlic-Milacic, Karen Rothfels, Lisa Matthews, Adam Wright, Bijay Jassal, Veronica Shamovsky, Quang Trinh, Marc E Gillespie, Cristoffer Sevilla, Krishna Tiwari, Eliot Ragueneau, Chuqiao Gong, Ralf Stephan, Bruce May, Robin Haw, Joel Weiser, Deidre Beavers, Patrick Conley, Henning Hermjakob, Lincoln D Stein, Peter D'Eustachio, Guanming Wu
Germline and somatic mutations can give rise to proteins with altered activity, including both gain and loss-of-function. The effects of these variants can be captured in disease-specific reactions and pathways that highlight the resulting changes to normal biology. A disease reaction is defined as an aberrant reaction in which a variant protein participates. A disease pathway is defined as a pathway that contains a disease reaction. Annotation of disease variants as participants of disease reactions and disease pathways can provide a standardized overview of molecular phenotypes of pathogenic variants that is amenable to computational mining and mathematical modeling. Reactome (https://reactome.org/), an open source, manually curated, peer-reviewed database of human biological pathways, in addition to providing annotations for >11 000 unique human proteins in the context of ∼15 000 wild-type reactions within more than 2000 wild-type pathways, also provides annotations for >4000 disease variants of close to 400 genes as participants of ∼800 disease reactions in the context of ∼400 disease pathways. Functional annotation of disease variants proceeds from normal gene functions, described in wild-type reactions and pathways, through disease variants whose divergence from normal molecular behaviors has been experimentally verified, to extrapolation from molecular phenotypes of characterized variants to variants of unknown significance using criteria of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Reactome's data model enables mapping of disease variant datasets to specific disease reactions within disease pathways, providing a platform to infer pathway output impacts of numerous human disease variants and model organism orthologs, complementing computational predictions of variant pathogenicity. Database URL: https://reactome.org/.
{"title":"Pathway-based, reaction-specific annotation of disease variants for elucidation of molecular phenotypes.","authors":"Marija Orlic-Milacic, Karen Rothfels, Lisa Matthews, Adam Wright, Bijay Jassal, Veronica Shamovsky, Quang Trinh, Marc E Gillespie, Cristoffer Sevilla, Krishna Tiwari, Eliot Ragueneau, Chuqiao Gong, Ralf Stephan, Bruce May, Robin Haw, Joel Weiser, Deidre Beavers, Patrick Conley, Henning Hermjakob, Lincoln D Stein, Peter D'Eustachio, Guanming Wu","doi":"10.1093/database/baae031","DOIUrl":"10.1093/database/baae031","url":null,"abstract":"<p><p>Germline and somatic mutations can give rise to proteins with altered activity, including both gain and loss-of-function. The effects of these variants can be captured in disease-specific reactions and pathways that highlight the resulting changes to normal biology. A disease reaction is defined as an aberrant reaction in which a variant protein participates. A disease pathway is defined as a pathway that contains a disease reaction. Annotation of disease variants as participants of disease reactions and disease pathways can provide a standardized overview of molecular phenotypes of pathogenic variants that is amenable to computational mining and mathematical modeling. Reactome (https://reactome.org/), an open source, manually curated, peer-reviewed database of human biological pathways, in addition to providing annotations for >11 000 unique human proteins in the context of ∼15 000 wild-type reactions within more than 2000 wild-type pathways, also provides annotations for >4000 disease variants of close to 400 genes as participants of ∼800 disease reactions in the context of ∼400 disease pathways. Functional annotation of disease variants proceeds from normal gene functions, described in wild-type reactions and pathways, through disease variants whose divergence from normal molecular behaviors has been experimentally verified, to extrapolation from molecular phenotypes of characterized variants to variants of unknown significance using criteria of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Reactome's data model enables mapping of disease variant datasets to specific disease reactions within disease pathways, providing a platform to infer pathway output impacts of numerous human disease variants and model organism orthologs, complementing computational predictions of variant pathogenicity. Database URL: https://reactome.org/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11184451/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140876125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1093/database/baae035
{"title":"Correction to: ESKtides: a comprehensive database and mining method for ESKAPE phage-derived antimicrobial peptides.","authors":"","doi":"10.1093/database/baae035","DOIUrl":"10.1093/database/baae035","url":null,"abstract":"","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11184445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140891796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}