首页 > 最新文献

Database: The Journal of Biological Databases and Curation最新文献

英文 中文
GdbMTB: a manually curated genomic database of magnetotactic bacteria. GdbMTB:一个人工整理的趋磁细菌基因组数据库。
IF 3.6 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-15 DOI: 10.1093/database/baaf090
Runjia Ji, Yongxin Pan, Wei Lin

Magnetotactic bacteria (MTB) are a unique group of microorganisms capable of navigating along geomagnetic field lines through the biomineralization of intracellular magnetic nanocrystals called magnetosomes. While genomic analyses have substantially advanced our understanding of these predominantly uncultured microorganisms, MTB genomic data remain scattered across multiple databases with inconsistent quality profiles and incomplete metadata, limiting comprehensive research efforts. To address these challenges, we developed the Genomic Database of Magnetotactic Bacteria (GdbMTB), the first comprehensive, curated genomic resource dedicated to MTB. The current version of GdbMTB integrates 365 publicly available MTB genomes and their associated metadata. Through a standardized bioinformatics workflow, it provides detailed genome quality assessments, taxonomic classifications, and annotations of magnetosome biomineralization genes, ensuring reliable data for downstream analyses. The curated metadata, encompassing environmental context and publication details, offers crucial research background, enabling users to trace the provenance of each genome. Additionally, GdbMTB offers a suite of bioinformatics tools and an analysis pipeline to facilitate advanced MTB studies. GdbMTB enhances accessibility to MTB genomic data, thereby promoting interdisciplinary research in microbiology, geomicrobiology, and biomineralization studies. Database URL: https://www.gdbmtb.cn/.

趋磁细菌(MTB)是一类独特的微生物,能够通过称为磁小体的细胞内磁性纳米晶体的生物矿化,沿着地磁力线导航。虽然基因组分析大大提高了我们对这些主要未培养微生物的了解,但结核分枝杆菌基因组数据仍然分散在多个数据库中,质量概况不一致,元数据不完整,限制了全面的研究工作。为了应对这些挑战,我们开发了趋磁细菌基因组数据库(GdbMTB),这是第一个专门针对趋磁细菌的综合性、精心策划的基因组资源。当前版本的GdbMTB集成了365个公开可用的MTB基因组及其相关元数据。通过标准化的生物信息学工作流程,它提供详细的基因组质量评估、分类分类和磁小体生物矿化基因注释,确保下游分析的可靠数据。经过整理的元数据,包括环境背景和出版细节,提供了重要的研究背景,使用户能够追踪每个基因组的来源。此外,GdbMTB还提供了一套生物信息学工具和分析管道,以促进先进的MTB研究。GdbMTB提高了结核分枝杆菌基因组数据的可及性,从而促进了微生物学、地球微生物学和生物矿化研究的跨学科研究。数据库地址:https://www.gdbmtb.cn/。
{"title":"GdbMTB: a manually curated genomic database of magnetotactic bacteria.","authors":"Runjia Ji, Yongxin Pan, Wei Lin","doi":"10.1093/database/baaf090","DOIUrl":"https://doi.org/10.1093/database/baaf090","url":null,"abstract":"<p><p>Magnetotactic bacteria (MTB) are a unique group of microorganisms capable of navigating along geomagnetic field lines through the biomineralization of intracellular magnetic nanocrystals called magnetosomes. While genomic analyses have substantially advanced our understanding of these predominantly uncultured microorganisms, MTB genomic data remain scattered across multiple databases with inconsistent quality profiles and incomplete metadata, limiting comprehensive research efforts. To address these challenges, we developed the Genomic Database of Magnetotactic Bacteria (GdbMTB), the first comprehensive, curated genomic resource dedicated to MTB. The current version of GdbMTB integrates 365 publicly available MTB genomes and their associated metadata. Through a standardized bioinformatics workflow, it provides detailed genome quality assessments, taxonomic classifications, and annotations of magnetosome biomineralization genes, ensuring reliable data for downstream analyses. The curated metadata, encompassing environmental context and publication details, offers crucial research background, enabling users to trace the provenance of each genome. Additionally, GdbMTB offers a suite of bioinformatics tools and an analysis pipeline to facilitate advanced MTB studies. GdbMTB enhances accessibility to MTB genomic data, thereby promoting interdisciplinary research in microbiology, geomicrobiology, and biomineralization studies. Database URL: https://www.gdbmtb.cn/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2026 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ePerturbDB: enhancer's experimental perturbation database. e摄动数据库:增强器的实验摄动数据库。
IF 3.6 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-15 DOI: 10.1093/database/baaf084
Samiksha Maurya, Jaidev Sharma, Amit Mandoli, Vibhor Kumar

Enhancers act as cis-regulatory elements, controlling the expression of genes according to developmental stages, external signalling, and cell states. Recent studies have shown the impact of perturbation of enhancer activity on expression of genes and cell properties. However, at the same time, perturbation of many enhancers does not always show substantial effect on the expression of genes or properties of cells. Hence, there is a need to identify enhancers that can be effectively targeted for therapeutics and understanding regulation. Therefore, a comprehensive resource containing information on the effect of knockdown of enhancers is needed. Here, we introduce a database ePerturbDB, which provides resources to search the effects of 83 743 experimental perturbations of enhancers. The ePerturbDB database allows users to compare their genomic loci to the list of perturbed enhancers to know their potential effect. It also provides enriched genes and ontology terms for query enhancer location overlapping with a known experimentally perturbed enhancer list. Thus, the resource and tool in ePerturbDB can help users build hypotheses and design experiments to find effective enhancer-based therapeutics and inferences about the regulation of cell states. Database URL: http://reggen.iiitd.edu.in:1207/ePerturbDB-html/.

增强子作为顺式调控元件,根据发育阶段、外部信号和细胞状态控制基因的表达。最近的研究表明,增强子活性的扰动对基因表达和细胞特性的影响。然而,与此同时,许多增强子的扰动并不总是显示出对基因表达或细胞特性的实质性影响。因此,有必要确定可以有效靶向治疗和理解调控的增强剂。因此,需要一个包含增强子敲除效果信息的综合资源。在这里,我们介绍了一个数据库ePerturbDB,它提供了资源来搜索83 743个实验扰动的增强子的影响。ePerturbDB数据库允许用户将他们的基因组位点与受干扰增强子列表进行比较,以了解它们的潜在影响。它还提供了丰富的基因和本体术语,用于查询增强子位置与已知的实验扰动增强子列表重叠。因此,e摄动数据库中的资源和工具可以帮助用户建立假设和设计实验,以找到有效的基于增强剂的治疗方法和关于细胞状态调节的推论。数据库地址:http://reggen.iiitd.edu.in:1207/ePerturbDB-html/。
{"title":"ePerturbDB: enhancer's experimental perturbation database.","authors":"Samiksha Maurya, Jaidev Sharma, Amit Mandoli, Vibhor Kumar","doi":"10.1093/database/baaf084","DOIUrl":"https://doi.org/10.1093/database/baaf084","url":null,"abstract":"<p><p>Enhancers act as cis-regulatory elements, controlling the expression of genes according to developmental stages, external signalling, and cell states. Recent studies have shown the impact of perturbation of enhancer activity on expression of genes and cell properties. However, at the same time, perturbation of many enhancers does not always show substantial effect on the expression of genes or properties of cells. Hence, there is a need to identify enhancers that can be effectively targeted for therapeutics and understanding regulation. Therefore, a comprehensive resource containing information on the effect of knockdown of enhancers is needed. Here, we introduce a database ePerturbDB, which provides resources to search the effects of 83 743 experimental perturbations of enhancers. The ePerturbDB database allows users to compare their genomic loci to the list of perturbed enhancers to know their potential effect. It also provides enriched genes and ontology terms for query enhancer location overlapping with a known experimentally perturbed enhancer list. Thus, the resource and tool in ePerturbDB can help users build hypotheses and design experiments to find effective enhancer-based therapeutics and inferences about the regulation of cell states. Database URL: http://reggen.iiitd.edu.in:1207/ePerturbDB-html/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2026 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SynVectorDB: embedding-based retrieval system for synthetic biology parts. SynVectorDB:基于嵌入的合成生物学部件检索系统。
IF 3.6 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-15 DOI: 10.1093/database/baaf088
Hao Li, Jiani Hu, Jie Song, Wei Zhou

Synthetic biology part discovery faces significant challenges due to inconsistent data organization and limited semantic search capabilities across existing repositories. We developed SynVectorDB, an embedding-based retrieval system that addresses these limitations through methodological innovations in data integration and AI-driven semantic search. Our approach integrates 19 850 biological parts from multiple sources (Addgene, iGEM Registry, laboratory collections), implementing systematic curation protocols that resulted in 7656 parts achieving verified status through literature-based validation and reliability assessment. We introduce a novel three-level hierarchical classification system organizing parts into functionally coherent categories (DNA Elements, RNA Elements, Coding Sequences, and Application Constructs) with detailed subcategorization. The core technical contribution employs BGE-M3 multilingual embeddings within a scalable vector database architecture to enable semantic similarity matching that significantly outperforms keyword-based retrieval methods. Standardized curation workflows enhance data comparability and search accuracy across heterogeneous sources. The dual deployment architecture ensures high performance through cloud services while maintaining open-source accessibility and deployment flexibility. The system maintains SBOL3 compatibility while providing innovative solutions for biological part organization and retrieval. Database URL: SynVectorDB is available in multiple deployment modes: web interface (https://svdb.sjtu.bio), local installation and source code (https://github.com/AilurusBio/synbio-parts-db), and MCP server integration for AI assistants (https://www.npmjs.com/package/synvectordb).

由于不一致的数据组织和有限的跨现有存储库的语义搜索能力,合成生物学部件发现面临着重大挑战。我们开发了SynVectorDB,这是一个基于嵌入的检索系统,通过在数据集成和人工智能驱动的语义搜索方面的方法创新来解决这些限制。我们的方法集成了来自多个来源(Addgene, iGEM Registry,实验室收集)的19850个生物部件,实施系统的管理协议,通过基于文献的验证和可靠性评估,使7656个部件达到验证状态。我们引入了一种新的三层分层分类系统,将部件组织成功能一致的类别(DNA元件、RNA元件、编码序列和应用结构),并进行了详细的亚分类。核心技术贡献是在可扩展的矢量数据库架构中使用BGE-M3多语言嵌入,以实现语义相似度匹配,显著优于基于关键字的检索方法。标准化的管理工作流程增强了跨异构数据源的数据可比性和搜索准确性。双部署架构通过云服务确保高性能,同时保持开源可访问性和部署灵活性。该系统在保持shol3兼容性的同时,为生物部件的组织和检索提供了创新的解决方案。数据库URL: SynVectorDB有多种部署方式:web界面(https://svdb.sjtu)。bio),本地安装和源代码(https://github.com/AilurusBio/synbio-parts-db),以及AI助手的MCP服务器集成(https://www.npmjs.com/package/synvectordb)。
{"title":"SynVectorDB: embedding-based retrieval system for synthetic biology parts.","authors":"Hao Li, Jiani Hu, Jie Song, Wei Zhou","doi":"10.1093/database/baaf088","DOIUrl":"https://doi.org/10.1093/database/baaf088","url":null,"abstract":"<p><p>Synthetic biology part discovery faces significant challenges due to inconsistent data organization and limited semantic search capabilities across existing repositories. We developed SynVectorDB, an embedding-based retrieval system that addresses these limitations through methodological innovations in data integration and AI-driven semantic search. Our approach integrates 19 850 biological parts from multiple sources (Addgene, iGEM Registry, laboratory collections), implementing systematic curation protocols that resulted in 7656 parts achieving verified status through literature-based validation and reliability assessment. We introduce a novel three-level hierarchical classification system organizing parts into functionally coherent categories (DNA Elements, RNA Elements, Coding Sequences, and Application Constructs) with detailed subcategorization. The core technical contribution employs BGE-M3 multilingual embeddings within a scalable vector database architecture to enable semantic similarity matching that significantly outperforms keyword-based retrieval methods. Standardized curation workflows enhance data comparability and search accuracy across heterogeneous sources. The dual deployment architecture ensures high performance through cloud services while maintaining open-source accessibility and deployment flexibility. The system maintains SBOL3 compatibility while providing innovative solutions for biological part organization and retrieval. Database URL: SynVectorDB is available in multiple deployment modes: web interface (https://svdb.sjtu.bio), local installation and source code (https://github.com/AilurusBio/synbio-parts-db), and MCP server integration for AI assistants (https://www.npmjs.com/package/synvectordb).</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2026 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
H-SPAR DB: human spaceflight platform for analysis and research-an integrative omics database for space health. H-SPAR DB:人类航天分析和研究平台——空间健康综合组学数据库。
IF 3.6 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2026-01-15 DOI: 10.1093/database/baaf083
Marios Tomazou, Marilena M Bourdakou, Eleni Nicolaidou, Grigoris Georgiou, Kyriaki Savva, Efi Athieniti, Styliana Menelaou, Sotiroula Afxenti, George M Spyrou

H-SPAR DB is a comprehensive database designed to support space health research by providing a unified platform for data integration, analysis, and interpretation. The database simplifies the complex workflows associated with spaceflight-related biology studies by combining curated molecular lists, transcriptomic datasets from NASA's GeneLab, and user-uploaded data into a streamlined, user-friendly interface. H-SPAR DB enables researchers to perform differential expression analysis, set operations, and association analyses while also generating integrative knowledge graphs around a space-related biological theme. The platform reduces the time required for data gathering and processing by offering a single platform for data exploration, analysis, and visualization. By integrating interactive visualizations and data tables, H-SPAR DB facilitates the interpretation of results, ultimately enhancing the efficiency of space biology research and fostering discoveries that address human health challenges in space. Researchers can access H-SPAR DB freely at https://bioinformatics.cing.ac.cy/H-SPARDB/ without login or other requirements.

H-SPAR数据库是一个综合性数据库,旨在通过提供数据集成、分析和解释的统一平台,支持空间健康研究。该数据库通过将精心整理的分子列表、来自NASA基因实验室的转录组数据集和用户上传的数据整合到一个流线型、用户友好的界面中,简化了与航天相关的生物学研究相关的复杂工作流程。H-SPAR DB使研究人员能够执行差异表达分析、集合操作和关联分析,同时还可以围绕与空间相关的生物学主题生成综合知识图。该平台通过提供数据探索、分析和可视化的单一平台,减少了数据收集和处理所需的时间。通过集成交互式可视化和数据表,H-SPAR数据库促进了对结果的解释,最终提高了空间生物学研究的效率,并促进了解决空间中人类健康挑战的发现。研究人员可以在https://bioinformatics.cing.ac.cy/H-SPARDB/上自由访问H-SPARDB,无需登录或其他要求。
{"title":"H-SPAR DB: human spaceflight platform for analysis and research-an integrative omics database for space health.","authors":"Marios Tomazou, Marilena M Bourdakou, Eleni Nicolaidou, Grigoris Georgiou, Kyriaki Savva, Efi Athieniti, Styliana Menelaou, Sotiroula Afxenti, George M Spyrou","doi":"10.1093/database/baaf083","DOIUrl":"https://doi.org/10.1093/database/baaf083","url":null,"abstract":"<p><p>H-SPAR DB is a comprehensive database designed to support space health research by providing a unified platform for data integration, analysis, and interpretation. The database simplifies the complex workflows associated with spaceflight-related biology studies by combining curated molecular lists, transcriptomic datasets from NASA's GeneLab, and user-uploaded data into a streamlined, user-friendly interface. H-SPAR DB enables researchers to perform differential expression analysis, set operations, and association analyses while also generating integrative knowledge graphs around a space-related biological theme. The platform reduces the time required for data gathering and processing by offering a single platform for data exploration, analysis, and visualization. By integrating interactive visualizations and data tables, H-SPAR DB facilitates the interpretation of results, ultimately enhancing the efficiency of space biology research and fostering discoveries that address human health challenges in space. Researchers can access H-SPAR DB freely at https://bioinformatics.cing.ac.cy/H-SPARDB/ without login or other requirements.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2026 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to: GymnoTOA-db: a database and application to optimize functional annotation in gymnosperms. 更正:GymnoTOA-db:一个优化裸子植物功能注释的数据库和应用程序。
IF 3.6 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-07-08 DOI: 10.1093/database/baaf041
{"title":"Correction to: GymnoTOA-db: a database and application to optimize functional annotation in gymnosperms.","authors":"","doi":"10.1093/database/baaf041","DOIUrl":"10.1093/database/baaf041","url":null,"abstract":"","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12560801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144583336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CAS: enhancing implicit constrained data augmentation with semantic enrichment for biomedical relation extraction and beyond. CAS:增强隐式约束数据增强与语义丰富的生物医学关系提取及其他。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/database/baaf025
Fang-Yi Su, Gia-Han Ngo, Ben Phan, Jung-Hsien Chiang

Biomedical relation extraction often involves datasets with implicit constraints, where structural, syntactic, or semantic rules must be strictly preserved to maintain data integrity. Traditional data augmentation techniques struggle in these scenarios, as they risk violating domain-specific constraints. To address these challenges, we propose CAS (Constrained Augmentation and Semantic-Quality), a novel framework designed for constrained datasets. CAS employs large language models to generate diverse data variations while adhering to predefined rules, and it integrates the SemQ Filter. This self-evaluation mechanism ensures the quality and consistency of augmented data by filtering out noisy or semantically incongruent samples. Although CAS is primarily designed for biomedical relation extraction, its versatile design extends its applicability to tasks with implicit constraints, such as code completion, mathematical reasoning, and information retrieval. Through extensive experiments across multiple domains, CAS demonstrates its ability to enhance model performance by maintaining structural fidelity and semantic accuracy in augmented data. These results highlight the potential of CAS not only in advancing biomedical NLP research but also in addressing data augmentation challenges in diverse constrained-task settings within natural language processing. Database URL: https://github.com/ngogiahan149/CAS.

生物医学关系提取通常涉及具有隐式约束的数据集,其中必须严格保留结构、语法或语义规则以保持数据完整性。传统的数据增强技术在这些情况下会遇到困难,因为它们有违反特定领域约束的风险。为了解决这些挑战,我们提出了CAS(约束增强和语义质量),这是一个为约束数据集设计的新框架。CAS使用大型语言模型来生成不同的数据变体,同时遵循预定义的规则,并且集成了SemQ Filter。这种自评价机制通过过滤掉噪声或语义不一致的样本来确保增强数据的质量和一致性。虽然CAS主要是为生物医学关系提取而设计的,但其通用的设计扩展了其对具有隐式约束的任务的适用性,例如代码补全、数学推理和信息检索。通过跨多个领域的广泛实验,CAS证明了其通过在增强数据中保持结构保真度和语义准确性来提高模型性能的能力。这些结果突出了CAS不仅在推进生物医学NLP研究方面的潜力,而且在解决自然语言处理中各种受限任务设置中的数据增强挑战方面的潜力。数据库地址:https://github.com/ngogiahan149/CAS。
{"title":"CAS: enhancing implicit constrained data augmentation with semantic enrichment for biomedical relation extraction and beyond.","authors":"Fang-Yi Su, Gia-Han Ngo, Ben Phan, Jung-Hsien Chiang","doi":"10.1093/database/baaf025","DOIUrl":"10.1093/database/baaf025","url":null,"abstract":"<p><p>Biomedical relation extraction often involves datasets with implicit constraints, where structural, syntactic, or semantic rules must be strictly preserved to maintain data integrity. Traditional data augmentation techniques struggle in these scenarios, as they risk violating domain-specific constraints. To address these challenges, we propose CAS (Constrained Augmentation and Semantic-Quality), a novel framework designed for constrained datasets. CAS employs large language models to generate diverse data variations while adhering to predefined rules, and it integrates the SemQ Filter. This self-evaluation mechanism ensures the quality and consistency of augmented data by filtering out noisy or semantically incongruent samples. Although CAS is primarily designed for biomedical relation extraction, its versatile design extends its applicability to tasks with implicit constraints, such as code completion, mathematical reasoning, and information retrieval. Through extensive experiments across multiple domains, CAS demonstrates its ability to enhance model performance by maintaining structural fidelity and semantic accuracy in augmented data. These results highlight the potential of CAS not only in advancing biomedical NLP research but also in addressing data augmentation challenges in diverse constrained-task settings within natural language processing. Database URL: https://github.com/ngogiahan149/CAS.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12224179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144552558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models. 蛋白质序列分析领域:任务类型、数据库、数据集、词嵌入方法和语言模型的系统回顾。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-05-30 DOI: 10.1093/database/baaf027
Muhammad Nabeel Asim, Tayyaba Asif, Faiza Hassan, Andreas Dengel

Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.

蛋白质序列分析检查蛋白质序列内氨基酸的顺序,以解锁有关生物过程和遗传疾病的丰富知识的不同类型。它通过发现独特的蛋白质特征或与特定疾病状态相关的生物标志物,有助于预测疾病的易感性。通过湿实验室实验进行蛋白质序列分析是昂贵、耗时且容易出错的。为了促进大规模蛋白质组学序列分析,生物界正在努力利用人工智能能力从湿实验室过渡到计算机辅助应用。然而,蛋白质组学和人工智能是两个不同的领域,人工智能驱动的蛋白质序列分析应用的开发需要这两个领域的知识。为了弥合这两个领域之间的差距,已经写了各种评论文章。然而,这些文章主要围绕个别任务或特定应用程序展开,而不是对广泛的任务和应用程序提供全面的概述。根据综合文献的需要,提出了广泛任务和应用的整体观点,本文的贡献是多方面的:它通过为63种不同的蛋白质序列分析任务提供全面的人工智能驱动应用,弥合了蛋白质组学和人工智能领域之间的差距。它为人工智能研究人员提供了63个蛋白质序列分析任务的生物学基础。它通过提供68个蛋白质数据库的全面细节,加强了人工智能驱动的蛋白质序列分析应用程序的开发。它提供了丰富的数据景观,包括63种不同蛋白质序列分析任务的627个基准数据集。重点介绍了25种独特的词嵌入方法和13种语言模型在人工智能驱动的蛋白质序列分析应用中的应用。它通过促进63个蛋白质序列分析任务的当前最先进性能,加速了人工智能驱动应用程序的开发。
{"title":"Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.","authors":"Muhammad Nabeel Asim, Tayyaba Asif, Faiza Hassan, Andreas Dengel","doi":"10.1093/database/baaf027","DOIUrl":"10.1093/database/baaf027","url":null,"abstract":"<p><p>Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12125710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144191613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing biomedical relation extraction through data-centric and preprocessing-robust ensemble learning approach. 通过以数据为中心和预处理鲁棒集成学习方法增强生物医学关系提取。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-05-22 DOI: 10.1093/database/baae127
Wilailack Meesawad, Jen-Chieh Han, Chun-Yu Hsueh, Yu Zhang, Hsi-Chuan Hung, Richard Tzong-Han Tsai

The paper describes our biomedical relation extraction system, which is designed to participate in the BioCreative VIII challenge Track 1: BioRED Track, which emphasizes the relation extraction from biomedical literature. Our system employs an ensemble learning method, leveraging the PubTator API in conjunction with multiple pretrained bidirectional encoder representations from transformer (BERT) models. Various preprocessing inputs are incorporated, encompassing prompt questions, entity ID pairs, and co-occurrence contexts. To enhance model comprehension, special tokens and boundary tags are incorporated. Specifically, we utilize PubMedBERT alongside the Max Rule ensemble learning mechanism to amalgamate outputs from diverse classifiers. Our findings surpass the established benchmark score, thereby providing a robust benchmark for evaluating performance in this task. Moreover, our study introduces and demonstrates the effectiveness of a data-centric approach, emphasizing the significance of prioritizing high-quality data instances in enhancing model performance and robustness.

本文描述了我们的生物医学关系提取系统,该系统是为参加BioCreative VIII挑战赛Track 1: BioRED Track而设计的,该Track强调从生物医学文献中提取关系。我们的系统采用集成学习方法,利用PubTator API与来自变压器(BERT)模型的多个预训练双向编码器表示相结合。合并了各种预处理输入,包括提示问题、实体ID对和共现上下文。为了增强模型的可理解性,加入了特殊的标记和边界标签。具体来说,我们利用PubMedBERT和Max Rule集成学习机制来合并来自不同分类器的输出。我们的发现超过了既定的基准得分,从而为评估该任务的性能提供了一个可靠的基准。此外,我们的研究介绍并证明了以数据为中心的方法的有效性,强调了优先考虑高质量数据实例在提高模型性能和鲁棒性方面的重要性。
{"title":"Enhancing biomedical relation extraction through data-centric and preprocessing-robust ensemble learning approach.","authors":"Wilailack Meesawad, Jen-Chieh Han, Chun-Yu Hsueh, Yu Zhang, Hsi-Chuan Hung, Richard Tzong-Han Tsai","doi":"10.1093/database/baae127","DOIUrl":"10.1093/database/baae127","url":null,"abstract":"<p><p>The paper describes our biomedical relation extraction system, which is designed to participate in the BioCreative VIII challenge Track 1: BioRED Track, which emphasizes the relation extraction from biomedical literature. Our system employs an ensemble learning method, leveraging the PubTator API in conjunction with multiple pretrained bidirectional encoder representations from transformer (BERT) models. Various preprocessing inputs are incorporated, encompassing prompt questions, entity ID pairs, and co-occurrence contexts. To enhance model comprehension, special tokens and boundary tags are incorporated. Specifically, we utilize PubMedBERT alongside the Max Rule ensemble learning mechanism to amalgamate outputs from diverse classifiers. Our findings surpass the established benchmark score, thereby providing a robust benchmark for evaluating performance in this task. Moreover, our study introduces and demonstrates the effectiveness of a data-centric approach, emphasizing the significance of prioritizing high-quality data instances in enhancing model performance and robustness.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12097206/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An exploratory study combining Virtual Reality and Semantic Web for life science research using Graph2VR. 基于Graph2VR的虚拟现实与语义网在生命科学研究中的结合探索性研究。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-05-20 DOI: 10.1093/database/baaf008
Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz

We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.

我们之前描述过Graph2VR,这是一个原型,使研究人员能够使用虚拟现实(VR)来探索和浏览使用SPARQL查询的关联数据图(见https://doi.org/10.1093/database/baae008)。在这里,我们评估了Graph2VR在三个现实生命科学用例中的使用。第一个用例通过EUCAN Connect目录将欧洲和加拿大大规模多中心队列研究的元数据可视化。第二个用例涉及一组来自合成罕见病患者的基因组数据,这些数据通过变体解释管道进行处理,然后转换为资源描述格式进行可视化。第三个用例涉及到用附加信息丰富图,在本例中,是荷兰解剖治疗化学代码本体和来自Drugbank的DrugID。这些例子共同展示了Graph2VR在数据探索和丰富方面的潜力,以及它的一些局限性。我们得出的结论是,VR提供的无限三维空间确实显示出巨大的潜力,可以导航非常大的知识图谱,我们为数据准备和VR工具的发展提供了建议。数据库地址:https://doi.org/10.1093/database/baaf008。
{"title":"An exploratory study combining Virtual Reality and Semantic Web for life science research using Graph2VR.","authors":"Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz","doi":"10.1093/database/baaf008","DOIUrl":"https://doi.org/10.1093/database/baaf008","url":null,"abstract":"<p><p>We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An exploratory study combining Virtual Reality and Semantic Web for life science research using Graph2VR. 基于Graph2VR的虚拟现实与语义网在生命科学研究中的结合探索性研究。
IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-05-20 DOI: 10.1093/database/baaf008
Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz

We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.

我们之前描述过Graph2VR,这是一个原型,使研究人员能够使用虚拟现实(VR)来探索和浏览使用SPARQL查询的关联数据图(见https://doi.org/10.1093/database/baae008)。在这里,我们评估了Graph2VR在三个现实生命科学用例中的使用。第一个用例通过EUCAN Connect目录将欧洲和加拿大大规模多中心队列研究的元数据可视化。第二个用例涉及一组来自合成罕见病患者的基因组数据,这些数据通过变体解释管道进行处理,然后转换为资源描述格式进行可视化。第三个用例涉及到用附加信息丰富图,在本例中,是荷兰解剖治疗化学代码本体和来自Drugbank的DrugID。这些例子共同展示了Graph2VR在数据探索和丰富方面的潜力,以及它的一些局限性。我们得出的结论是,VR提供的无限三维空间确实显示出巨大的潜力,可以导航非常大的知识图谱,我们为数据准备和VR工具的发展提供了建议。数据库地址:https://doi.org/10.1093/database/baaf008。
{"title":"An exploratory study combining Virtual Reality and Semantic Web for life science research using Graph2VR.","authors":"Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz","doi":"10.1093/database/baaf008","DOIUrl":"10.1093/database/baaf008","url":null,"abstract":"<p><p>We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12090995/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144110024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Database: The Journal of Biological Databases and Curation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1