Pub Date : 2024-10-23DOI: 10.1093/database/baae103
Ekin Soysal, Kirk Roberts
This manuscript presents PheNormGPT, a framework for extraction and normalization of key findings in clinical text. PheNormGPT relies on an innovative approach, leveraging large language models to extract key findings and phenotypic data in unstructured clinical text and map them to Human Phenotype Ontology concepts. It utilizes OpenAI's GPT-3.5 Turbo and GPT-4 models with fine-tuning and few-shot learning strategies, including a novel few-shot learning strategy for custom-tailored few-shot example selection per request. PheNormGPT was evaluated in the BioCreative VIII Track 3: Genetic Phenotype Extraction from Dysmorphology Physical Examination Entries shared task. PheNormGPT achieved an F1 score of 0.82 for standard matching and 0.72 for exact matching, securing first place for this shared task.
{"title":"PheNormGPT: a framework for extraction and normalization of key medical findings.","authors":"Ekin Soysal, Kirk Roberts","doi":"10.1093/database/baae103","DOIUrl":"https://doi.org/10.1093/database/baae103","url":null,"abstract":"<p><p>This manuscript presents PheNormGPT, a framework for extraction and normalization of key findings in clinical text. PheNormGPT relies on an innovative approach, leveraging large language models to extract key findings and phenotypic data in unstructured clinical text and map them to Human Phenotype Ontology concepts. It utilizes OpenAI's GPT-3.5 Turbo and GPT-4 models with fine-tuning and few-shot learning strategies, including a novel few-shot learning strategy for custom-tailored few-shot example selection per request. PheNormGPT was evaluated in the BioCreative VIII Track 3: Genetic Phenotype Extraction from Dysmorphology Physical Examination Entries shared task. PheNormGPT achieved an F1 score of 0.82 for standard matching and 0.72 for exact matching, securing first place for this shared task.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11498178/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142496679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-21DOI: 10.1093/database/baae106
Shuo Xu, Yuefu Zhang, Liang Chen, Xin An
The ever-increasing volume of COVID-19-related articles presents a significant challenge for the manual curation and multilabel topic classification of LitCovid. For this purpose, a novel multilabel topic classification framework is developed in this study, which considers both the correlation and imbalance of topic labels, while empowering the pretrained model. With the help of this framework, this study devotes to answering the following question: Do full texts, MeSH (Medical Subject Heading), and biological entities of articles about COVID-19 encode more discriminative information than metadata (title, abstract, keyword, and journal name)? From extensive experiments on our enriched version of the BC7-LitCovid corpus and Hallmarks of Cancer corpus, the following conclusions can be drawn. Our framework demonstrates superior performance and robustness. The metadata of scientific publications about COVID-19 carries valuable information for multilabel topic classification. Compared to biological entities, full texts and MeSH can further enhance the performance of our framework for multilabel topic classification, but the improved performance is very limited. Database URL: https://github.com/pzczxs/Enriched-BC7-LitCovid.
{"title":"Is metadata of articles about COVID-19 enough for multilabel topic classification task?","authors":"Shuo Xu, Yuefu Zhang, Liang Chen, Xin An","doi":"10.1093/database/baae106","DOIUrl":"10.1093/database/baae106","url":null,"abstract":"<p><p>The ever-increasing volume of COVID-19-related articles presents a significant challenge for the manual curation and multilabel topic classification of LitCovid. For this purpose, a novel multilabel topic classification framework is developed in this study, which considers both the correlation and imbalance of topic labels, while empowering the pretrained model. With the help of this framework, this study devotes to answering the following question: Do full texts, MeSH (Medical Subject Heading), and biological entities of articles about COVID-19 encode more discriminative information than metadata (title, abstract, keyword, and journal name)? From extensive experiments on our enriched version of the BC7-LitCovid corpus and Hallmarks of Cancer corpus, the following conclusions can be drawn. Our framework demonstrates superior performance and robustness. The metadata of scientific publications about COVID-19 carries valuable information for multilabel topic classification. Compared to biological entities, full texts and MeSH can further enhance the performance of our framework for multilabel topic classification, but the improved performance is very limited. Database URL: https://github.com/pzczxs/Enriched-BC7-LitCovid.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11492800/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142459944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-19DOI: 10.1093/database/baae105
Cao Hengchun, Guo Hui, Yang Weifei, Li Guiting, Ju Ming, Duan Yinghui, Tian Qiuzhen, Ma Qin, Feng Xiaoxu, Zhang Zhanyou, Zhang Haiyang, Miao Hongmei
Sesame (Sesamum indicum L., 2n = 26) is a crucial oilseed crop cultivated worldwide. The ancient evolutionary position of the Sesamum genus highlights its value for genomics and molecular genetics research among the angiosperms of other genera. However, Sesamum is considered a small orphan genus with only a few genomic databases for cultivated sesame to date. The urgent need to construct comprehensive, curated genome databases that include genus-specific gene resources for both cultivated and wild Sesamum species is being recognized. In response, we developed Sesamum Genomics Database (SesamumGDB), a user-friendly genomic database that integrates extensive genomic resources from two cultivated sesame varieties (S. indicum) and seven wild Sesamum species, covering all three chromosome groups (2n = 26, 32, and 64). This database showcases a total of 352 471 genes, including 6026 related to lipid metabolism and 17 625 transcription factors within Sesamum. Equipped with an array of bioinformatics tools such as BLAST (basic local alignment search tool) and JBrowse (the Javascript browser), SesamumGDB facilitates data downloading, screening, visualization, and analysis. As the first centralized Sesamum genome database, SesamumGDB offers extensive insights into the genomics and genetics of sesame, potentially enhancing the molecular breeding of sesame and other oilseed crops in the future. Database URL: http://www.sgbdb.com/sgdb/.
{"title":"SesamumGDB: a comprehensive platform for Sesamum genetics and genomics analysis.","authors":"Cao Hengchun, Guo Hui, Yang Weifei, Li Guiting, Ju Ming, Duan Yinghui, Tian Qiuzhen, Ma Qin, Feng Xiaoxu, Zhang Zhanyou, Zhang Haiyang, Miao Hongmei","doi":"10.1093/database/baae105","DOIUrl":"10.1093/database/baae105","url":null,"abstract":"<p><p>Sesame (Sesamum indicum L., 2n = 26) is a crucial oilseed crop cultivated worldwide. The ancient evolutionary position of the Sesamum genus highlights its value for genomics and molecular genetics research among the angiosperms of other genera. However, Sesamum is considered a small orphan genus with only a few genomic databases for cultivated sesame to date. The urgent need to construct comprehensive, curated genome databases that include genus-specific gene resources for both cultivated and wild Sesamum species is being recognized. In response, we developed Sesamum Genomics Database (SesamumGDB), a user-friendly genomic database that integrates extensive genomic resources from two cultivated sesame varieties (S. indicum) and seven wild Sesamum species, covering all three chromosome groups (2n = 26, 32, and 64). This database showcases a total of 352 471 genes, including 6026 related to lipid metabolism and 17 625 transcription factors within Sesamum. Equipped with an array of bioinformatics tools such as BLAST (basic local alignment search tool) and JBrowse (the Javascript browser), SesamumGDB facilitates data downloading, screening, visualization, and analysis. As the first centralized Sesamum genome database, SesamumGDB offers extensive insights into the genomics and genetics of sesame, potentially enhancing the molecular breeding of sesame and other oilseed crops in the future. Database URL: http://www.sgbdb.com/sgdb/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11490215/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142460043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-18DOI: 10.1093/database/baae100
Abhay Deep Pandey, Ghanshyam Sharma, Anshula Sharma, Sudhanshu Vrati, Deepak T Nair
Many drug discovery exercises fail because small molecules that are effective inhibitors of target proteins exhibit high cellular toxicity. Early and effective assessment of toxicity and pharmacokinetics is essential to accelerate the drug discovery process. Conventional methods for toxicity profiling, including in vitro and in vivo assays, are laborious and resource-intensive. In response, we introduce the Small Molecule Cell Viability Database (SMCVdb), a comprehensive resource containing toxicity data for over 24 000 compounds obtained through high-content imaging (HCI). SMCVdb seamlessly integrates chemical descriptions and molecular weight data, offering researchers a holistic platform for toxicity data aiding compound prioritization and selection based on biological and economic considerations. Data collection for SMCVdb involved a systematic approach combining HCI toxicity profiling with chemical information and quality control measures ensured data accuracy and consistency. The user-friendly web interface of SMCVdb provides multiple search and filter options, allowing users to query the database based on compound name, molecular weight range, or viability percentage. SMCVdb empowers users to access toxicity profiles, molecular weights, compound names, and chemical descriptions, facilitating the exploration of relationships between compound properties and their effects on cell viability. In summary, the database provides experimentally derived cellular toxicity information for over 24 000 drug candidate molecules to academic researchers, and pharmaceutical companies. The SMCVdb will keep growing and will prove to be a pivotal resource to expedite research in drug discovery and compound evaluation. Database URL: http://smcvdb.rcb.ac.in:4321/.
{"title":"SMCVdb: a database of experimental cellular toxicity information for drug candidate molecules.","authors":"Abhay Deep Pandey, Ghanshyam Sharma, Anshula Sharma, Sudhanshu Vrati, Deepak T Nair","doi":"10.1093/database/baae100","DOIUrl":"10.1093/database/baae100","url":null,"abstract":"<p><p>Many drug discovery exercises fail because small molecules that are effective inhibitors of target proteins exhibit high cellular toxicity. Early and effective assessment of toxicity and pharmacokinetics is essential to accelerate the drug discovery process. Conventional methods for toxicity profiling, including in vitro and in vivo assays, are laborious and resource-intensive. In response, we introduce the Small Molecule Cell Viability Database (SMCVdb), a comprehensive resource containing toxicity data for over 24 000 compounds obtained through high-content imaging (HCI). SMCVdb seamlessly integrates chemical descriptions and molecular weight data, offering researchers a holistic platform for toxicity data aiding compound prioritization and selection based on biological and economic considerations. Data collection for SMCVdb involved a systematic approach combining HCI toxicity profiling with chemical information and quality control measures ensured data accuracy and consistency. The user-friendly web interface of SMCVdb provides multiple search and filter options, allowing users to query the database based on compound name, molecular weight range, or viability percentage. SMCVdb empowers users to access toxicity profiles, molecular weights, compound names, and chemical descriptions, facilitating the exploration of relationships between compound properties and their effects on cell viability. In summary, the database provides experimentally derived cellular toxicity information for over 24 000 drug candidate molecules to academic researchers, and pharmaceutical companies. The SMCVdb will keep growing and will prove to be a pivotal resource to expedite research in drug discovery and compound evaluation. Database URL: http://smcvdb.rcb.ac.in:4321/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11488516/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142460044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-16DOI: 10.1093/database/baae109
Nico Cillari, Giuseppe Neri, Nadia Pisanti, Paolo Milazzo, Ugo Borello
Rett syndrome (RTT) is a neurodevelopmental disorder occurring almost exclusively in females and leading to a variety of impairments and disabilities from mild to severe. In >95% cases, RTT is due to mutations in the X-linked gene MECP2, but the molecular mechanisms determining RTT are unknown at present, and the complexity of the system is challenging. To facilitate and provide guidance to the unraveling of those mechanisms, we developed a database resource for the visualization and analysis of the genomic landscape in the context of wild-type or mutated Mecp2 gene in the mouse model. Our resource allows for the exploration of differential dynamics of gene expression and the prediction of new potential MECP2 target genes to decipher the RTT disorder molecular mechanisms. Database URL: https://biomedinfo.di.unipi.it/rett-database/.
{"title":"RettDb: the Rett syndrome omics database to navigate the Rett syndrome genomic landscape.","authors":"Nico Cillari, Giuseppe Neri, Nadia Pisanti, Paolo Milazzo, Ugo Borello","doi":"10.1093/database/baae109","DOIUrl":"10.1093/database/baae109","url":null,"abstract":"<p><p>Rett syndrome (RTT) is a neurodevelopmental disorder occurring almost exclusively in females and leading to a variety of impairments and disabilities from mild to severe. In >95% cases, RTT is due to mutations in the X-linked gene MECP2, but the molecular mechanisms determining RTT are unknown at present, and the complexity of the system is challenging. To facilitate and provide guidance to the unraveling of those mechanisms, we developed a database resource for the visualization and analysis of the genomic landscape in the context of wild-type or mutated Mecp2 gene in the mouse model. Our resource allows for the exploration of differential dynamics of gene expression and the prediction of new potential MECP2 target genes to decipher the RTT disorder molecular mechanisms. Database URL: https://biomedinfo.di.unipi.it/rett-database/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11482253/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142459945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-12DOI: 10.1093/database/baae096
Farha Anwer, Ahmad Navid, Fiza Faiz, Uzair Haider, Samavi Nasir, Muhammad Farooq, Maryam Zahra, Anosh Bano, Hafiza Hira Bashir, Madiha Ahmad, Syeda Aleena Abbas, Shah E Room, Muhammad Tariq Saeed, Amjad Ali
Acinetobacter baumannii has emerged as a prominent nosocomial pathogen, exhibiting a progressive rise in resistance to therapeutic interventions. This rise in resistance calls for alternative strategies. Here, we propose an alternative yet specialized resource on antimicrobial peptides (AMPs) against A. baumannii. Database 'AbAMPdb' is the manually curated collection of 300 entries containing the 250 experimental AMP sequences and 50 corresponding synthetic or mutated AMP sequences. The mutated sequences were modified with reported amino acid substitutions intended for decreasing the toxicity and increasing the antimicrobial potency. AbAMPdb also provides 3D models of all 300 AMPs, comprising 250 natural and 50 synthetic or mutated AMPs. Moreover, the database offers docked complexes comprising 5000 AMPs and their corresponding A. baumannii target proteins. These complexes, accessible in Protein Data Bank format, enable the 2D visualization of the interacting amino acid residues. We are confident that this comprehensive resource furnishes vital information concerning AMPs, encompassing their docking interactions with virulence factors and antibiotic resistance proteins of A. baumannii. To enhance clinical relevance, the characterized AMPs could undergo further investigation both in vitro and in vivo. Database URL: https://abampdb.mgbio.tech/.
{"title":"AbAMPdb: a database of Acinetobacter baumannii specific antimicrobial peptides.","authors":"Farha Anwer, Ahmad Navid, Fiza Faiz, Uzair Haider, Samavi Nasir, Muhammad Farooq, Maryam Zahra, Anosh Bano, Hafiza Hira Bashir, Madiha Ahmad, Syeda Aleena Abbas, Shah E Room, Muhammad Tariq Saeed, Amjad Ali","doi":"10.1093/database/baae096","DOIUrl":"10.1093/database/baae096","url":null,"abstract":"<p><p>Acinetobacter baumannii has emerged as a prominent nosocomial pathogen, exhibiting a progressive rise in resistance to therapeutic interventions. This rise in resistance calls for alternative strategies. Here, we propose an alternative yet specialized resource on antimicrobial peptides (AMPs) against A. baumannii. Database 'AbAMPdb' is the manually curated collection of 300 entries containing the 250 experimental AMP sequences and 50 corresponding synthetic or mutated AMP sequences. The mutated sequences were modified with reported amino acid substitutions intended for decreasing the toxicity and increasing the antimicrobial potency. AbAMPdb also provides 3D models of all 300 AMPs, comprising 250 natural and 50 synthetic or mutated AMPs. Moreover, the database offers docked complexes comprising 5000 AMPs and their corresponding A. baumannii target proteins. These complexes, accessible in Protein Data Bank format, enable the 2D visualization of the interacting amino acid residues. We are confident that this comprehensive resource furnishes vital information concerning AMPs, encompassing their docking interactions with virulence factors and antibiotic resistance proteins of A. baumannii. To enhance clinical relevance, the characterized AMPs could undergo further investigation both in vitro and in vivo. Database URL: https://abampdb.mgbio.tech/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11470754/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142459942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-12DOI: 10.1093/database/baae111
Hyojung Paik, Chunryong Oh, Sajid Hussain, Sangjae Seo, Soon Woo Park, Tae Lyun Ko, Ari Lee
The development of therapeutic agents has mainly focused on designing small molecules to modulate target proteins or genes which are conventionally druggable. Therefore, targeted protein degradation (TPD) for undruggable cases has emerged as promising pharmaceutical approach. TPD, often referred PROTACs (PROteolysis TArgeting Chimeras), uses a linker to degrade target proteins by hijacking the ubiquitination system. Therefore, unravel the relationship including reversal and co-expression between E3 ligands and other possible target genes in various human tissues is essential to mitigate off-target effects of TPD. Here, we developed the atlas of E3 ligases in human tissues (ELiAH), to prioritize E3 ligase-target gene pairs for TPD. Leveraging over 2900 of RNA-seq profiles consisting of 11 human tissues from the GTEx (genotype-tissue expression) consortium, users of ELiAH can identify tissue-specific genes and E3 ligases (FDR P-value of Mann-Whitney test < .05). ELiAH unravels 933 830 relationships consisting of 614 E3 ligases and 20 924 of expressed genes considering degree of tissue specificity, which are indispensable for ubiquitination based TPD development. In addition, docking properties of those relationships are also modeled using RosettaDock. Therefore, ELiAH presents comprehensive repertoire of E3 ligases for ubiquitination-based TPD drug development avoiding off-target effects. Database URL: https://eliahdb.org.
{"title":"ELiAH: the atlas of E3 ligases in human tissues for targeted protein degradation with reduced off-target effect.","authors":"Hyojung Paik, Chunryong Oh, Sajid Hussain, Sangjae Seo, Soon Woo Park, Tae Lyun Ko, Ari Lee","doi":"10.1093/database/baae111","DOIUrl":"10.1093/database/baae111","url":null,"abstract":"<p><p>The development of therapeutic agents has mainly focused on designing small molecules to modulate target proteins or genes which are conventionally druggable. Therefore, targeted protein degradation (TPD) for undruggable cases has emerged as promising pharmaceutical approach. TPD, often referred PROTACs (PROteolysis TArgeting Chimeras), uses a linker to degrade target proteins by hijacking the ubiquitination system. Therefore, unravel the relationship including reversal and co-expression between E3 ligands and other possible target genes in various human tissues is essential to mitigate off-target effects of TPD. Here, we developed the atlas of E3 ligases in human tissues (ELiAH), to prioritize E3 ligase-target gene pairs for TPD. Leveraging over 2900 of RNA-seq profiles consisting of 11 human tissues from the GTEx (genotype-tissue expression) consortium, users of ELiAH can identify tissue-specific genes and E3 ligases (FDR P-value of Mann-Whitney test < .05). ELiAH unravels 933 830 relationships consisting of 614 E3 ligases and 20 924 of expressed genes considering degree of tissue specificity, which are indispensable for ubiquitination based TPD development. In addition, docking properties of those relationships are also modeled using RosettaDock. Therefore, ELiAH presents comprehensive repertoire of E3 ligases for ubiquitination-based TPD drug development avoiding off-target effects. Database URL: https://eliahdb.org.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11470751/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142459943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-12DOI: 10.1093/database/baae108
Nicolas Haas, Julie Dawn Thompson, Jean-Paul Renaud, Kirsley Chennen, Olivier Poch
Nonsense variations, characterized by premature termination codons, play a major role in human genetic diseases as well as in cancer susceptibility. Despite their high prevalence, effective therapeutic strategies targeting premature termination codons remain a challenge. To understand and explore the intricate mechanisms involved, we developed StopKB, a comprehensive knowledgebase aggregating data from multiple sources on nonsense variations, associated genes, diseases, and phenotypes. StopKB identifies 637 317 unique nonsense variations, distributed across 18 022 human genes and linked to 3206 diseases and 7765 phenotypes. Notably, ∼32% of these variations are classified as nonsense-mediated mRNA decay-insensitive, potentially representing suitable targets for nonsense suppression therapies. We also provide an interactive web interface to facilitate efficient and intuitive data exploration, enabling researchers and clinicians to navigate the complex landscape of nonsense variations. StopKB represents a valuable resource for advancing research in precision medicine and more specifically, the development of targeted therapeutic interventions for genetic diseases associated with nonsense variations. Database URL: https://lbgi.fr/stopkb/.
{"title":"StopKB: a comprehensive knowledgebase for nonsense suppression therapies.","authors":"Nicolas Haas, Julie Dawn Thompson, Jean-Paul Renaud, Kirsley Chennen, Olivier Poch","doi":"10.1093/database/baae108","DOIUrl":"10.1093/database/baae108","url":null,"abstract":"<p><p>Nonsense variations, characterized by premature termination codons, play a major role in human genetic diseases as well as in cancer susceptibility. Despite their high prevalence, effective therapeutic strategies targeting premature termination codons remain a challenge. To understand and explore the intricate mechanisms involved, we developed StopKB, a comprehensive knowledgebase aggregating data from multiple sources on nonsense variations, associated genes, diseases, and phenotypes. StopKB identifies 637 317 unique nonsense variations, distributed across 18 022 human genes and linked to 3206 diseases and 7765 phenotypes. Notably, ∼32% of these variations are classified as nonsense-mediated mRNA decay-insensitive, potentially representing suitable targets for nonsense suppression therapies. We also provide an interactive web interface to facilitate efficient and intuitive data exploration, enabling researchers and clinicians to navigate the complex landscape of nonsense variations. StopKB represents a valuable resource for advancing research in precision medicine and more specifically, the development of targeted therapeutic interventions for genetic diseases associated with nonsense variations. Database URL: https://lbgi.fr/stopkb/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11470752/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142460045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1093/database/baae107
Petr Novotný, Jan Wild
The unifying element of all biodiversity data is the issue of taxon hierarchy modeling. We compared 25 existing databases in terms of handling taxa hierarchy and presentation of this data. We used documentation or demo installations of databases as a source of information and next in line was the analysis of structures using R packages provided by inspected platforms. If neither of these was available, we used the public interface of individual databases. For almost half (12) of the databases analyzed, we did not find any formalized taxa hierarchy data structure, providing only biological information about taxon membership in higher ranks, which is not fully formalizable and thus not generally usable. The least effective Adjacency List model (storing parentId of a taxon) dominates among the remaining providers. This study demonstrates the lack of attention paid by current biodiversity databases to modeling taxon hierarchy, particularly to making it available to researchers in the form of a hierarchical data structure within the data provided. For biodiversity relational databases, the Closure Table type is the most suitable of the known data models, which also corresponds to the ontology concept. However, its use is rather sporadic within the biodiversity databases ecosystem.
所有生物多样性数据的统一要素是分类群层次建模问题。我们比较了 25 个现有数据库在处理分类群层次结构和展示这些数据方面的情况。我们使用数据库的文档或演示安装作为信息来源,其次是使用检查平台提供的 R 软件包分析结构。如果两者都没有,我们就使用个别数据库的公共界面。在我们分析的数据库中,几乎有一半(12 个)没有发现任何正式的分类群层次数据结构,只提供了关于更高等级分类群成员的生物信息,而这些信息并不完全正式,因此一般无法使用。在剩下的提供者中,效果最差的邻接表模型(存储分类群的父Id)占主导地位。这项研究表明,目前的生物多样性数据库缺乏对分类群等级建模的关注,尤其是在所提供的数据中以等级数据结构的形式向研究人员提供分类群等级。对于生物多样性关系数据库而言,闭合表类型是已知数据模型中最合适的一种,也符合本体概念。不过,在生物多样性数据库生态系统中,这种数据模型的使用还比较零散。
{"title":"The relational modeling of hierarchical data in biodiversity databases.","authors":"Petr Novotný, Jan Wild","doi":"10.1093/database/baae107","DOIUrl":"10.1093/database/baae107","url":null,"abstract":"<p><p>The unifying element of all biodiversity data is the issue of taxon hierarchy modeling. We compared 25 existing databases in terms of handling taxa hierarchy and presentation of this data. We used documentation or demo installations of databases as a source of information and next in line was the analysis of structures using R packages provided by inspected platforms. If neither of these was available, we used the public interface of individual databases. For almost half (12) of the databases analyzed, we did not find any formalized taxa hierarchy data structure, providing only biological information about taxon membership in higher ranks, which is not fully formalizable and thus not generally usable. The least effective Adjacency List model (storing parentId of a taxon) dominates among the remaining providers. This study demonstrates the lack of attention paid by current biodiversity databases to modeling taxon hierarchy, particularly to making it available to researchers in the form of a hierarchical data structure within the data provided. For biodiversity relational databases, the Closure Table type is the most suitable of the known data models, which also corresponds to the ontology concept. However, its use is rather sporadic within the biodiversity databases ecosystem.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11466226/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142399684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1093/database/baae104
Cong-Phuoc Phan, Ben Phan, Jung-Hsien Chiang
Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.
尽管参加 "BioCreative VIII Track 01 "的团队做出了大量研究努力,采用各种技术来实现生物医学关系任务的高准确性,但该领域的整体性能仍有很大的提升空间。大型语言模型为提高自然语言处理任务中现有技术的性能带来了新的机遇。本文介绍了我们对关系提取方法的改进,其中包括整合两个著名的大型语言模型:Gemini 和 GPT-4。我们的新方法利用 GPT-4 生成用于训练的增强数据,然后利用集合学习技术将不同模型的输出结合起来,以创建更精确的预测。然后,我们采用一种使用 Gemini 响应作为输入的方法,对 BioNLP-PubMed-Bert 分类模型进行微调,从而在挑战赛评估中使用的相同测试数据集上,通过精确度、召回率和 F1 分数衡量,提高了性能。数据库网址:https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/。
{"title":"Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini.","authors":"Cong-Phuoc Phan, Ben Phan, Jung-Hsien Chiang","doi":"10.1093/database/baae104","DOIUrl":"10.1093/database/baae104","url":null,"abstract":"<p><p>Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11463225/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142388766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}