Database: The Journal of Biological Databases and Curation最新文献_第7页

MANUDB: database and application to retrieve and visualize mammalian NUMTs. 检索和可视化哺乳动物numt的数据库和应用程序。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-22 DOI: 10.1093/database/baaf009

Bálint Biró, Zoltán Gál, Zsófia Nagy, Juan Francisco Garcia, Tsend-Ayush Batbold, Orsolya Ivett Hoffmann

There is an ongoing genetic flow from the mitochondrial genome to the nuclear genome. The mitochondrial sequences that have integrated into the nuclear genome have been shown to be drivers of evolutionary processes and cancerous transformations. In addition to their fundamental biological importance, these sequences have significant consequences for genome assembly and phylogenetic and forensic analyses as well. Previously, our research group developed a computational pipeline that provides a uniform way of identifying these sequences in mammalian genomes. In this paper, we publish MANUDB-the MAmmalian NUclear mitochondrial sequences DataBase, which makes the results of our pipeline publicly accessible. With MANUDB one can retrieve and visualize mitochondrial genome fragments that have been integrated into the nuclear genome of mammalian species. Database URL: manudb.streamlit.app.

从线粒体基因组到核基因组有一个持续的遗传流动。已经整合到核基因组中的线粒体序列已被证明是进化过程和癌变的驱动因素。除了它们的基本生物学重要性外，这些序列对基因组组装、系统发育和法医分析也有重要的影响。以前，我们的研究小组开发了一种计算管道，提供了一种在哺乳动物基因组中识别这些序列的统一方法。在本文中，我们发布了manudb -哺乳动物核线粒体序列数据库，使我们的管道结果公开访问。使用MANUDB可以检索和可视化已经整合到哺乳动物物种核基因组中的线粒体基因组片段。数据库URL: manudb. streamlite .app。

引用次数: 0

PotatoBSLnc: a curated repository of potato long noncoding RNAs in response to biotic stress. PotatoBSLnc：马铃薯长链非编码rna的资源库，以应对生物胁迫。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-22 DOI: 10.1093/database/baaf015

Pingping Huang, Weilin Cao, Zhaojun Li, Qingshuai Chen, Guangchao Wang, Bailing Zhou, Jihua Wang

The biotic stress significantly influences the production of potato (Solanum tuberosum L.) all over the world. Long noncoding RNAs (lncRNAs) play key roles in the plant response to environmental stressors. However, their roles in potato resistance to pathogens, insects, and other biotic stress are still unclear. The PotatoBSLnc is a database for the study of potato lncRNAs in response to major biotic stress. Here, we collected 364 RNA sequencing (RNA-seq) data derived from 12 kinds of biotic stresses in 26 cultivars and wild potatoes. PotatoBSLnc currently contains 18 636 lncRNAs and 44 263 mRNAs. In addition, to select the functional lncRNAs and mRNAs under different stresses, the differential expression analyses and the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses related to the cis/trans-targets of differentially expressed lncRNAs (DElncRNAs) and to the differentially expressed mRNAs (DEmRNAs) were also conducted. The database contains five modules: Home, Browse, Expression, Biotic stress, and Download. Among these, the "Browse" module can be used to search detailed information about RNA-seq data (disease, cultivator, organ types, treatment of samples, and others), the exon numbers, length, location, and sequence of each lncRNA/mRNA. The "Expression" module can be used to search the transcripts per million/raw count value of lncRNAs/mRNAs at different RNA-seq data. The "Biotic stress" module shows the results of differential expression analyses under each of the 12 biotic stresses, the cis/trans-targets of DElncRNAs, the GO and KEGG analysis results of DEmRNAs, and the targets of DElncRNAs. The PotatoBSLnc platform provides researchers with detailed information on potato lncRNAs and mRNAs under biotic stress, which can speed up the breeding of resistant varieties based on the molecular methods. Database URL: https://www.sdklab-biophysics-dzu.net/PotatoBSLnc.

生物胁迫对马铃薯（Solanum tuberosum L.）产量有显著影响。长链非编码rna （lncRNAs）在植物对环境胁迫的反应中起着关键作用。然而，它们在马铃薯抵抗病原体、昆虫和其他生物胁迫中的作用仍不清楚。PotatoBSLnc是一个研究马铃薯lncrna对主要生物胁迫反应的数据库。在此，我们收集了来自26个品种和野生马铃薯的12种生物胁迫的364个RNA测序（RNA-seq）数据。PotatoBSLnc目前包含18 636个lncrna和44 263个mrna。此外，为了选择不同胁迫下的功能性lncRNAs和mrna，我们还进行了差异表达分析，以及与差异表达lncRNAs （DElncRNAs）和差异表达mrna （DEmRNAs）的顺式/反式靶标相关的基因本体（GO）和京都基因与基因组百科全书（KEGG）分析。数据库包含五个模块：首页、浏览、表达、生物压力和下载。其中，“Browse”模块可用于查询RNA-seq数据的详细信息（疾病、培养、器官类型、样品处理等），以及每个lncRNA/mRNA的外显子数目、长度、位置和序列。“Expression”模块可用于查询不同RNA-seq数据下lncRNAs/ mrna的转录本/原始计数值。“生物胁迫”模块显示了12种生物胁迫下的差异表达分析结果、DElncRNAs的顺式/反式靶标、demrna的GO和KEGG分析结果以及DElncRNAs的靶标。PotatoBSLnc平台为研究人员提供了生物胁迫下马铃薯lncrna和mrna的详细信息，可以加快基于分子方法的抗性品种的选育。数据库地址：https://www.sdklab-biophysics-dzu.net/PotatoBSLnc。

{"title":"PotatoBSLnc: a curated repository of potato long noncoding RNAs in response to biotic stress.","authors":"Pingping Huang, Weilin Cao, Zhaojun Li, Qingshuai Chen, Guangchao Wang, Bailing Zhou, Jihua Wang","doi":"10.1093/database/baaf015","DOIUrl":"https://doi.org/10.1093/database/baaf015","url":null,"abstract":"The biotic stress significantly influences the production of potato (Solanum tuberosum L.) all over the world. Long noncoding RNAs (lncRNAs) play key roles in the plant response to environmental stressors. However, their roles in potato resistance to pathogens, insects, and other biotic stress are still unclear. The PotatoBSLnc is a database for the study of potato lncRNAs in response to major biotic stress. Here, we collected 364 RNA sequencing (RNA-seq) data derived from 12 kinds of biotic stresses in 26 cultivars and wild potatoes. PotatoBSLnc currently contains 18 636 lncRNAs and 44 263 mRNAs. In addition, to select the functional lncRNAs and mRNAs under different stresses, the differential expression analyses and the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses related to the cis/trans-targets of differentially expressed lncRNAs (DElncRNAs) and to the differentially expressed mRNAs (DEmRNAs) were also conducted. The database contains five modules: Home, Browse, Expression, Biotic stress, and Download. Among these, the \"Browse\" module can be used to search detailed information about RNA-seq data (disease, cultivator, organ types, treatment of samples, and others), the exon numbers, length, location, and sequence of each lncRNA/mRNA. The \"Expression\" module can be used to search the transcripts per million/raw count value of lncRNAs/mRNAs at different RNA-seq data. The \"Biotic stress\" module shows the results of differential expression analyses under each of the 12 biotic stresses, the cis/trans-targets of DElncRNAs, the GO and KEGG analysis results of DEmRNAs, and the targets of DElncRNAs. The PotatoBSLnc platform provides researchers with detailed information on potato lncRNAs and mRNAs under biotic stress, which can speed up the breeding of resistant varieties based on the molecular methods. Database URL: https://www.sdklab-biophysics-dzu.net/PotatoBSLnc.","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating AI-powered text mining from PubTator into the manual curation workflow at the Comparative Toxicogenomics Database. 将来自PubTator的人工智能文本挖掘集成到比较毒物基因组学数据库的手动管理工作流程中。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-21 DOI: 10.1093/database/baaf013

Thomas C Wiegers, Allan Peter Davis, Jolene Wiegers, Daniela Sciaky, Fern Barkalow, Brent Wyatt, Melissa Strong, Roy McMorran, Sakib Abrar, Carolyn J Mattingly

The Comparative Toxicogenomics Database (CTD) is a manually curated knowledge- and discovery-base that seeks to advance understanding about the relationship between environmental exposures and human health. CTD's manual curation process extracts from the biomedical literature molecular relationships between chemicals/drugs, genes/proteins, phenotypes, diseases, anatomical terms, and species. These relationships are organized in a highly systematic way in order to make them not only informative but also scientifically computational, enabling inferential hypotheses to be formed to address gaps in understanding. Integral to CTD's functionality is the use of structured, hierarchical ontologies and controlled vocabularies to describe these molecular relationships. Normalizing text (i.e. translating raw text from the literature into these controlled vocabularies) can be a time-consuming process for biocurators. To facilitate the normalization process and improve the efficiency with which our scientists curate the literature, CTD evaluated and integrated into the curation process PubTator 3.0, a state-of-the-art, AI-powered resource which extracts and normalizes from the literature many of the key biomedical concepts CTD curates. Here, we describe CTD's long-standing history with Natural Language Processing (NLP), how this history helped form our objectives for NLP integration, the evaluation of PubTator against our objectives, and the integration of PubTator into CTD's curation workflow. Database URL: https://ctdbase.org.

比较毒物基因组学数据库（CTD）是一个人工管理的知识和发现基础，旨在促进对环境暴露与人类健康之间关系的理解。CTD的人工整理过程从生物医学文献中提取化学物质/药物、基因/蛋白质、表型、疾病、解剖术语和物种之间的分子关系。这些关系以一种高度系统的方式组织起来，以便使它们不仅具有信息性，而且具有科学计算性，从而形成推理假设，以解决理解上的差距。CTD功能的一部分是使用结构化的、分层的本体和受控词汇表来描述这些分子关系。规范化文本（即将原始文本从文献翻译成这些受控词汇表）对于生物馆长来说可能是一个耗时的过程。为了促进规范化过程并提高我们的科学家整理文献的效率，CTD评估了PubTator 3.0，并将其整合到整理过程中。PubTator 3.0是一种最先进的人工智能资源，可以从CTD整理的文献中提取许多关键的生物医学概念并进行规范化。在这里，我们描述了CTD与自然语言处理（NLP）的长期历史，这段历史如何帮助形成我们的NLP集成目标，根据我们的目标评估PubTator，以及将PubTator集成到CTD的策展工作流程中。数据库地址：https://ctdbase.org。

{"title":"Integrating AI-powered text mining from PubTator into the manual curation workflow at the Comparative Toxicogenomics Database.","authors":"Thomas C Wiegers, Allan Peter Davis, Jolene Wiegers, Daniela Sciaky, Fern Barkalow, Brent Wyatt, Melissa Strong, Roy McMorran, Sakib Abrar, Carolyn J Mattingly","doi":"10.1093/database/baaf013","DOIUrl":"10.1093/database/baaf013","url":null,"abstract":"The Comparative Toxicogenomics Database (CTD) is a manually curated knowledge- and discovery-base that seeks to advance understanding about the relationship between environmental exposures and human health. CTD's manual curation process extracts from the biomedical literature molecular relationships between chemicals/drugs, genes/proteins, phenotypes, diseases, anatomical terms, and species. These relationships are organized in a highly systematic way in order to make them not only informative but also scientifically computational, enabling inferential hypotheses to be formed to address gaps in understanding. Integral to CTD's functionality is the use of structured, hierarchical ontologies and controlled vocabularies to describe these molecular relationships. Normalizing text (i.e. translating raw text from the literature into these controlled vocabularies) can be a time-consuming process for biocurators. To facilitate the normalization process and improve the efficiency with which our scientists curate the literature, CTD evaluated and integrated into the curation process PubTator 3.0, a state-of-the-art, AI-powered resource which extracts and normalizes from the literature many of the key biomedical concepts CTD curates. Here, we describe CTD's long-standing history with Natural Language Processing (NLP), how this history helped form our objectives for NLP integration, the evaluation of PubTator against our objectives, and the integration of PubTator into CTD's curation workflow. Database URL: https://ctdbase.org.","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11844237/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143472396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating AI-powered text mining from PubTator into the manual curation workflow at the Comparative Toxicogenomics Database. 将来自PubTator的人工智能文本挖掘集成到比较毒物基因组学数据库的手动管理工作流程中。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-21 DOI: 10.1093/database/baaf013

Thomas C Wiegers, Allan Peter Davis, Jolene Wiegers, Daniela Sciaky, Fern Barkalow, Brent Wyatt, Melissa Strong, Roy McMorran, Sakib Abrar, Carolyn J Mattingly

The Comparative Toxicogenomics Database (CTD) is a manually curated knowledge- and discovery-base that seeks to advance understanding about the relationship between environmental exposures and human health. CTD's manual curation process extracts from the biomedical literature molecular relationships between chemicals/drugs, genes/proteins, phenotypes, diseases, anatomical terms, and species. These relationships are organized in a highly systematic way in order to make them not only informative but also scientifically computational, enabling inferential hypotheses to be formed to address gaps in understanding. Integral to CTD's functionality is the use of structured, hierarchical ontologies and controlled vocabularies to describe these molecular relationships. Normalizing text (i.e. translating raw text from the literature into these controlled vocabularies) can be a time-consuming process for biocurators. To facilitate the normalization process and improve the efficiency with which our scientists curate the literature, CTD evaluated and integrated into the curation process PubTator 3.0, a state-of-the-art, AI-powered resource which extracts and normalizes from the literature many of the key biomedical concepts CTD curates. Here, we describe CTD's long-standing history with Natural Language Processing (NLP), how this history helped form our objectives for NLP integration, the evaluation of PubTator against our objectives, and the integration of PubTator into CTD's curation workflow. Database URL: https://ctdbase.org.

比较毒物基因组学数据库（CTD）是一个人工管理的知识和发现基础，旨在促进对环境暴露与人类健康之间关系的理解。CTD的人工整理过程从生物医学文献中提取化学物质/药物、基因/蛋白质、表型、疾病、解剖术语和物种之间的分子关系。这些关系以一种高度系统的方式组织起来，以便使它们不仅具有信息性，而且具有科学计算性，从而形成推理假设，以解决理解上的差距。CTD功能的一部分是使用结构化的、分层的本体和受控词汇表来描述这些分子关系。规范化文本（即将原始文本从文献翻译成这些受控词汇表）对于生物馆长来说可能是一个耗时的过程。为了促进规范化过程并提高我们的科学家整理文献的效率，CTD评估了PubTator 3.0，并将其整合到整理过程中。PubTator 3.0是一种最先进的人工智能资源，可以从CTD整理的文献中提取许多关键的生物医学概念并进行规范化。在这里，我们描述了CTD与自然语言处理（NLP）的长期历史，这段历史如何帮助形成我们的NLP集成目标，根据我们的目标评估PubTator，以及将PubTator集成到CTD的策展工作流程中。数据库地址：https://ctdbase.org。

{"title":"Integrating AI-powered text mining from PubTator into the manual curation workflow at the Comparative Toxicogenomics Database.","authors":"Thomas C Wiegers, Allan Peter Davis, Jolene Wiegers, Daniela Sciaky, Fern Barkalow, Brent Wyatt, Melissa Strong, Roy McMorran, Sakib Abrar, Carolyn J Mattingly","doi":"10.1093/database/baaf013","DOIUrl":"https://doi.org/10.1093/database/baaf013","url":null,"abstract":"The Comparative Toxicogenomics Database (CTD) is a manually curated knowledge- and discovery-base that seeks to advance understanding about the relationship between environmental exposures and human health. CTD's manual curation process extracts from the biomedical literature molecular relationships between chemicals/drugs, genes/proteins, phenotypes, diseases, anatomical terms, and species. These relationships are organized in a highly systematic way in order to make them not only informative but also scientifically computational, enabling inferential hypotheses to be formed to address gaps in understanding. Integral to CTD's functionality is the use of structured, hierarchical ontologies and controlled vocabularies to describe these molecular relationships. Normalizing text (i.e. translating raw text from the literature into these controlled vocabularies) can be a time-consuming process for biocurators. To facilitate the normalization process and improve the efficiency with which our scientists curate the literature, CTD evaluated and integrated into the curation process PubTator 3.0, a state-of-the-art, AI-powered resource which extracts and normalizes from the literature many of the key biomedical concepts CTD curates. Here, we describe CTD's long-standing history with Natural Language Processing (NLP), how this history helped form our objectives for NLP integration, the evaluation of PubTator against our objectives, and the integration of PubTator into CTD's curation workflow. Database URL: https://ctdbase.org.","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LICEDB: light industrial core enzyme database for industrial applications and AI enzyme design. LICEDB：用于工业应用和AI酶设计的轻工业核心酶数据库。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-19 DOI: 10.1093/database/baaf001

Lei Gong, Fufeng Liu, Chuanxi Zhang, Yongfan Ming, Yulan Mou, ZhaoTing Yuan, Haiming Jiang, Bei Gao, Fuping Lu, Lujia Zhang

Enzymes, serving as eco-friendly catalysts, are progressively supplanting traditional chemical catalysts in light industry sectors such as feed, papermaking, textiles, detergents, leather, and sugar production. Despite this advancement, the variability in the performance of natural enzymes and the fragmentation and diversity of existing data formats pose significant challenges to researchers. Furthermore, AI-driven enzyme design is limited by the quality and quantity of available data. To address these issues, we introduce the light industrial core enzyme database (LICEDB), the first database dedicated exclusively to managing and standardizing enzymes for light industry applications. LICEDB, with its integrated modules for data retrieval, similarity analysis, and structural analysis, will enhance the efficient industrial application of enzymes and strengthen AI-driven predictive research, thereby advancing data sharing and utilization in the field of enzyme innovation. Database URL: http://lujialab.org.cn/on-line-databases/.

酶作为环保催化剂，在饲料、造纸、纺织、洗涤剂、皮革和制糖等轻工业领域正逐步取代传统的化学催化剂。尽管取得了这一进展，但天然酶性能的可变性以及现有数据格式的碎片化和多样性给研究人员带来了重大挑战。此外，人工智能驱动的酶设计受到可用数据的质量和数量的限制。为了解决这些问题，我们推出了轻工业核心酶数据库（LICEDB），这是第一个专门用于管理和标准化轻工业应用酶的数据库。LICEDB集成了数据检索、相似度分析和结构分析等模块，将提高酶的高效工业应用，加强ai驱动的预测研究，从而促进酶创新领域的数据共享和利用。数据库地址：http://lujialab.org.cn/on-line-databases/。

{"title":"LICEDB: light industrial core enzyme database for industrial applications and AI enzyme design.","authors":"Lei Gong, Fufeng Liu, Chuanxi Zhang, Yongfan Ming, Yulan Mou, ZhaoTing Yuan, Haiming Jiang, Bei Gao, Fuping Lu, Lujia Zhang","doi":"10.1093/database/baaf001","DOIUrl":"10.1093/database/baaf001","url":null,"abstract":"Enzymes, serving as eco-friendly catalysts, are progressively supplanting traditional chemical catalysts in light industry sectors such as feed, papermaking, textiles, detergents, leather, and sugar production. Despite this advancement, the variability in the performance of natural enzymes and the fragmentation and diversity of existing data formats pose significant challenges to researchers. Furthermore, AI-driven enzyme design is limited by the quality and quantity of available data. To address these issues, we introduce the light industrial core enzyme database (LICEDB), the first database dedicated exclusively to managing and standardizing enzymes for light industry applications. LICEDB, with its integrated modules for data retrieval, similarity analysis, and structural analysis, will enhance the efficient industrial application of enzymes and strengthen AI-driven predictive research, thereby advancing data sharing and utilization in the field of enzyme innovation. Database URL: http://lujialab.org.cn/on-line-databases/.","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11842304/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143467293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LICEDB: light industrial core enzyme database for industrial applications and AI enzyme design. LICEDB：用于工业应用和AI酶设计的轻工业核心酶数据库。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-19 DOI: 10.1093/database/baaf001

Lei Gong, Fufeng Liu, Chuanxi Zhang, Yongfan Ming, Yulan Mou, ZhaoTing Yuan, Haiming Jiang, Bei Gao, Fuping Lu, Lujia Zhang

Enzymes, serving as eco-friendly catalysts, are progressively supplanting traditional chemical catalysts in light industry sectors such as feed, papermaking, textiles, detergents, leather, and sugar production. Despite this advancement, the variability in the performance of natural enzymes and the fragmentation and diversity of existing data formats pose significant challenges to researchers. Furthermore, AI-driven enzyme design is limited by the quality and quantity of available data. To address these issues, we introduce the light industrial core enzyme database (LICEDB), the first database dedicated exclusively to managing and standardizing enzymes for light industry applications. LICEDB, with its integrated modules for data retrieval, similarity analysis, and structural analysis, will enhance the efficient industrial application of enzymes and strengthen AI-driven predictive research, thereby advancing data sharing and utilization in the field of enzyme innovation. Database URL: http://lujialab.org.cn/on-line-databases/.

酶作为环保催化剂，在饲料、造纸、纺织、洗涤剂、皮革和制糖等轻工业领域正逐步取代传统的化学催化剂。尽管取得了这一进展，但天然酶性能的可变性以及现有数据格式的碎片化和多样性给研究人员带来了重大挑战。此外，人工智能驱动的酶设计受到可用数据的质量和数量的限制。为了解决这些问题，我们推出了轻工业核心酶数据库（LICEDB），这是第一个专门用于管理和标准化轻工业应用酶的数据库。LICEDB集成了数据检索、相似度分析和结构分析等模块，将提高酶的高效工业应用，加强ai驱动的预测研究，从而促进酶创新领域的数据共享和利用。数据库地址：http://lujialab.org.cn/on-line-databases/。

{"title":"LICEDB: light industrial core enzyme database for industrial applications and AI enzyme design.","authors":"Lei Gong, Fufeng Liu, Chuanxi Zhang, Yongfan Ming, Yulan Mou, ZhaoTing Yuan, Haiming Jiang, Bei Gao, Fuping Lu, Lujia Zhang","doi":"10.1093/database/baaf001","DOIUrl":"https://doi.org/10.1093/database/baaf001","url":null,"abstract":"Enzymes, serving as eco-friendly catalysts, are progressively supplanting traditional chemical catalysts in light industry sectors such as feed, papermaking, textiles, detergents, leather, and sugar production. Despite this advancement, the variability in the performance of natural enzymes and the fragmentation and diversity of existing data formats pose significant challenges to researchers. Furthermore, AI-driven enzyme design is limited by the quality and quantity of available data. To address these issues, we introduce the light industrial core enzyme database (LICEDB), the first database dedicated exclusively to managing and standardizing enzymes for light industry applications. LICEDB, with its integrated modules for data retrieval, similarity analysis, and structural analysis, will enhance the efficient industrial application of enzymes and strengthen AI-driven predictive research, thereby advancing data sharing and utilization in the field of enzyme innovation. Database URL: http://lujialab.org.cn/on-line-databases/.","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction to: CardioHotspots: a database of mutational hotspots for cardiac disorders. cardiohotspot：心脏疾病突变热点数据库。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-17 DOI: 10.1093/database/baaf014

引用次数: 0

Assessing the performance of generative artificial intelligence in retrieving information against manually curated genetic and genomic data. 评估生成式人工智能在检索人工整理的遗传和基因组数据信息方面的性能。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-17 DOI: 10.1093/database/baaf011

Elly Poretsky, Victoria C Blake, Carson M Andorf, Taner Z Sen

Curated resources at centralized repositories provide high-value service to users by enhancing data veracity. Curation, however, comes with a cost, as it requires dedicated time and effort from personnel with deep domain knowledge. In this paper, we investigate the performance of a large language model (LLM), specifically generative pre-trained transformer (GPT)-3.5 and GPT-4, in extracting and presenting data against a human curator. In order to accomplish this task, we used a small set of journal articles on wheat and barley genetics, focusing on traits, such as salinity tolerance and disease resistance, which are becoming more important. The 36 papers were then curated by a professional curator for the GrainGenes database (https://wheat.pw.usda.gov). In parallel, we developed a GPT-based retrieval-augmented generation question-answering system and compared how GPT performed in answering questions about traits and quantitative trait loci (QTLs). Our findings show that on average GPT-4 correctly categorized manuscripts 97% of the time, correctly extracted 80% of traits, and 61% of marker-trait associations. Furthermore, we assessed the ability of a GPT-based DataFrame agent to filter and summarize curated wheat genetics data, showing the potential of human and computational curators working side-by-side. In one case study, our findings show that GPT-4 was able to retrieve up to 91% of disease related, human-curated QTLs across the whole genome, and up to 96% across a specific genomic region through prompt engineering. Also, we observed that across most tasks, GPT-4 consistently outperformed GPT-3.5 while generating less hallucinations, suggesting that improvements in LLM models will make generative artificial intelligence a much more accurate companion for curators in extracting information from scientific literature. Despite their limitations, LLMs demonstrated a potential to extract and present information to curators and users of biological databases, as long as users are aware of potential inaccuracies and the possibility of incomplete information extraction.

集中存储库中的策划资源通过增强数据的准确性为用户提供高价值的服务。然而，管理是有成本的，因为它需要具有深厚领域知识的人员投入时间和精力。在本文中，我们研究了大型语言模型（LLM）的性能，特别是生成预训练转换器(GPT)-3.5和GPT-4，在针对人类管理员提取和呈现数据方面。为了完成这项任务，我们使用了一小部分关于小麦和大麦遗传学的期刊文章，重点关注诸如耐盐性和抗病性等性状，这些性状正变得越来越重要。随后，这36篇论文由GrainGenes数据库（https://wheat.pw.usda.gov）的专业管理员进行了整理。同时，我们开发了一个基于GPT的检索增强生成问答系统，并比较了GPT在回答性状和数量性状位点（qtl）问题中的表现。我们的研究结果表明，平均而言，GPT-4对手稿的正确分类率为97%，正确提取了80%的特征，并正确提取了61%的标记-性状关联。此外，我们评估了基于gpt的DataFrame代理过滤和汇总整理小麦遗传数据的能力，显示了人类和计算管理员并肩工作的潜力。在一个案例研究中，我们的研究结果表明，GPT-4能够在整个基因组中检索高达91%的与疾病相关的、人类策划的qtl，通过快速工程在特定基因组区域检索高达96%的qtl。此外，我们观察到，在大多数任务中，GPT-4的表现始终优于GPT-3.5，同时产生的幻觉更少，这表明LLM模型的改进将使生成式人工智能成为策展人从科学文献中提取信息的更准确的伙伴。尽管存在局限性，法学硕士展示了提取和呈现信息给生物数据库管理员和用户的潜力，只要用户意识到潜在的不准确性和信息提取不完整的可能性。

{"title":"Assessing the performance of generative artificial intelligence in retrieving information against manually curated genetic and genomic data.","authors":"Elly Poretsky, Victoria C Blake, Carson M Andorf, Taner Z Sen","doi":"10.1093/database/baaf011","DOIUrl":"https://doi.org/10.1093/database/baaf011","url":null,"abstract":"Curated resources at centralized repositories provide high-value service to users by enhancing data veracity. Curation, however, comes with a cost, as it requires dedicated time and effort from personnel with deep domain knowledge. In this paper, we investigate the performance of a large language model (LLM), specifically generative pre-trained transformer (GPT)-3.5 and GPT-4, in extracting and presenting data against a human curator. In order to accomplish this task, we used a small set of journal articles on wheat and barley genetics, focusing on traits, such as salinity tolerance and disease resistance, which are becoming more important. The 36 papers were then curated by a professional curator for the GrainGenes database (https://wheat.pw.usda.gov). In parallel, we developed a GPT-based retrieval-augmented generation question-answering system and compared how GPT performed in answering questions about traits and quantitative trait loci (QTLs). Our findings show that on average GPT-4 correctly categorized manuscripts 97% of the time, correctly extracted 80% of traits, and 61% of marker-trait associations. Furthermore, we assessed the ability of a GPT-based DataFrame agent to filter and summarize curated wheat genetics data, showing the potential of human and computational curators working side-by-side. In one case study, our findings show that GPT-4 was able to retrieve up to 91% of disease related, human-curated QTLs across the whole genome, and up to 96% across a specific genomic region through prompt engineering. Also, we observed that across most tasks, GPT-4 consistently outperformed GPT-3.5 while generating less hallucinations, suggesting that improvements in LLM models will make generative artificial intelligence a much more accurate companion for curators in extracting information from scientific literature. Despite their limitations, LLMs demonstrated a potential to extract and present information to curators and users of biological databases, as long as users are aware of potential inaccuracies and the possibility of incomplete information extraction.","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction to: CardioHotspots: a database of mutational hotspots for cardiac disorders. cardiohotspot：心脏疾病突变热点数据库。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-17 DOI: 10.1093/database/baaf014

引用次数: 0

Correction to: CardioHotspots: a database of mutational hotspots for cardiac disorders. cardiohotspot：心脏疾病突变热点数据库。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation

Pub Date : 2025-02-17 DOI: 10.1093/database/baaf014

引用次数: 0