Pub Date : 2025-01-18DOI: 10.1093/database/baaf037
Evgeniia M Maksiutenko, Igor V Bezdvornykh, Yury A Barbitoff, Yulia A Nasykhova, Andrey S Glotov
Pregnancy loss is an important reproductive health problem that affects many couples. Genetic factors play an important role in both spontaneous miscarriage and recurrent pregnancy loss, and the effect of genomic variants is recognized as one of the major causes of pregnancy loss in euploid foetuses. In this work, we extend our previous analysis of the genetic landscape of pregnancy loss and develop a Pregnancy Loss genetic Variant (PLoV) database to aggregate information about mutations that have been implicated in pregnancy loss. The database contains information about 534 genetic variants that have been observed in 421 cases across 47 studies, including foetus-only, parent-only, and trio-based studies. For each case, the database includes a detailed description of the phenotype, including ultrasound data (if provided in the original article). The genetic variants are scattered across all chromosomes in the human genome and affect a total of 292 unique genes. We provide a public access to the PLoV database at https://plovdb.ott.ru/. Database URL: https://plovdb.ott.ru/.
{"title":"PLoV: a comprehensive database of genetic variants leading to pregnancy loss.","authors":"Evgeniia M Maksiutenko, Igor V Bezdvornykh, Yury A Barbitoff, Yulia A Nasykhova, Andrey S Glotov","doi":"10.1093/database/baaf037","DOIUrl":"10.1093/database/baaf037","url":null,"abstract":"<p><p>Pregnancy loss is an important reproductive health problem that affects many couples. Genetic factors play an important role in both spontaneous miscarriage and recurrent pregnancy loss, and the effect of genomic variants is recognized as one of the major causes of pregnancy loss in euploid foetuses. In this work, we extend our previous analysis of the genetic landscape of pregnancy loss and develop a Pregnancy Loss genetic Variant (PLoV) database to aggregate information about mutations that have been implicated in pregnancy loss. The database contains information about 534 genetic variants that have been observed in 421 cases across 47 studies, including foetus-only, parent-only, and trio-based studies. For each case, the database includes a detailed description of the phenotype, including ultrasound data (if provided in the original article). The genetic variants are scattered across all chromosomes in the human genome and affect a total of 292 unique genes. We provide a public access to the PLoV database at https://plovdb.ott.ru/. Database URL: https://plovdb.ott.ru/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":" ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462621/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144583339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-18DOI: 10.1093/database/baaf053
Gabriele Baniulyte, Sawyer M Hicks, Morgan A Sammons
The tumour suppressor gene TP53 encodes the DNA binding transcription factor p53 and is one of the most mutated genes in human cancer. Tumour suppressor activity requires binding of p53 to its DNA response elements and subsequent transcriptional activation of a diverse set of target genes. Despite decades of close study, the logic underlying p53 interactions with its numerous potential genomic binding sites and target genes is not yet fully understood. Here, we present a database of DNA and chromatin-based information focused on putative p53 binding sites in the human genome to allow users to generate and test new hypotheses related to p53 activity in the genome. Users can query genomic locations based on experimentally observed p53 binding, regulatory element activity, genetic variation, evolutionary conservation, chromatin modification state, and chromatin structure. We present multiple use cases demonstrating the utility of this database for generating novel biological hypotheses, such as chromatin-based determinants of p53 binding and potential cell type-specific p53 activity. All database information is also available as a precompiled SQLite database for use in local analysis or as a Shiny web application. Database URL: https://p53motifDB.its.albany.edu.
{"title":"p53motifDB: integration of genomic information and tumour suppressor p53 binding motifs.","authors":"Gabriele Baniulyte, Sawyer M Hicks, Morgan A Sammons","doi":"10.1093/database/baaf053","DOIUrl":"10.1093/database/baaf053","url":null,"abstract":"<p><p>The tumour suppressor gene TP53 encodes the DNA binding transcription factor p53 and is one of the most mutated genes in human cancer. Tumour suppressor activity requires binding of p53 to its DNA response elements and subsequent transcriptional activation of a diverse set of target genes. Despite decades of close study, the logic underlying p53 interactions with its numerous potential genomic binding sites and target genes is not yet fully understood. Here, we present a database of DNA and chromatin-based information focused on putative p53 binding sites in the human genome to allow users to generate and test new hypotheses related to p53 activity in the genome. Users can query genomic locations based on experimentally observed p53 binding, regulatory element activity, genetic variation, evolutionary conservation, chromatin modification state, and chromatin structure. We present multiple use cases demonstrating the utility of this database for generating novel biological hypotheses, such as chromatin-based determinants of p53 binding and potential cell type-specific p53 activity. All database information is also available as a precompiled SQLite database for use in local analysis or as a Shiny web application. Database URL: https://p53motifDB.its.albany.edu.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462388/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145136787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-18DOI: 10.1093/database/baaf078
Jeffrey Furlong, Stephanie Goya, Eric P Nawrocki, Vincent Calhoun, Eneida Hatcher, Linda Yankie, Alexander L Greninger
Accurate annotation of viral genomes is essential for reliable downstream analysis and public data sharing. While National Center for Biotechnology Information's (NCBI's) Viral Annotation DefineR (VADR) pipeline provides standardized annotation and quality control, it only supports six viral groups to date. Here, we developed and validated 12 new reference sequence-based VADR models targeting key human respiratory viruses: measles virus, mumps virus, rubella virus, human metapneumovirus, human parainfluenza virus types 1-4, and seasonal coronaviruses (229E, NL63, OC43, and HKU1). Model construction was guided by a comprehensive analysis of intra-species genomic and phylogenetic diversity, enabling the development of genotype-specific models associated with reference genomes that defined expected genome structure and annotation. Models were trained on 5327 publicly available complete viral genomes and tested on 372 viral genomes not yet submitted to GenBank. VADR passed 96.3% of publicly available viral genomes and 98.1% of viral genomes not in the training set, correctly identifying overlapping ORFs, mature peptides, and transcriptional slippage as well as genome misassemblies. VADR detected novel viral biology including the first reported HCoV-OC43 NS2 knockout in a human infection and novel G and SH coding sequence lengths in human metapneumovirus. These VADR models are publicly available and are used by NCBI curators as part of the GenBank submission pipeline, supporting high-quality, scalable viral genome annotation for research and public health.
{"title":"Automated annotation and validation of human respiratory virus sequences using VADR.","authors":"Jeffrey Furlong, Stephanie Goya, Eric P Nawrocki, Vincent Calhoun, Eneida Hatcher, Linda Yankie, Alexander L Greninger","doi":"10.1093/database/baaf078","DOIUrl":"10.1093/database/baaf078","url":null,"abstract":"<p><p>Accurate annotation of viral genomes is essential for reliable downstream analysis and public data sharing. While National Center for Biotechnology Information's (NCBI's) Viral Annotation DefineR (VADR) pipeline provides standardized annotation and quality control, it only supports six viral groups to date. Here, we developed and validated 12 new reference sequence-based VADR models targeting key human respiratory viruses: measles virus, mumps virus, rubella virus, human metapneumovirus, human parainfluenza virus types 1-4, and seasonal coronaviruses (229E, NL63, OC43, and HKU1). Model construction was guided by a comprehensive analysis of intra-species genomic and phylogenetic diversity, enabling the development of genotype-specific models associated with reference genomes that defined expected genome structure and annotation. Models were trained on 5327 publicly available complete viral genomes and tested on 372 viral genomes not yet submitted to GenBank. VADR passed 96.3% of publicly available viral genomes and 98.1% of viral genomes not in the training set, correctly identifying overlapping ORFs, mature peptides, and transcriptional slippage as well as genome misassemblies. VADR detected novel viral biology including the first reported HCoV-OC43 NS2 knockout in a human infection and novel G and SH coding sequence lengths in human metapneumovirus. These VADR models are publicly available and are used by NCBI curators as part of the GenBank submission pipeline, supporting high-quality, scalable viral genome annotation for research and public health.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12648392/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145602905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-18DOI: 10.1093/database/baaf079
Pan Zhang, Tianxiang Ouyang, Xiaowen Hu, Jie Huang, Biao Xiao, Zhijian Huang, Xingyang Shi, Xinyi Wu, Linying Chen, Yongkang Wu, Hanyue Wang, Ying Zhang, Guangdi Li, Hui Liu, Lei Deng
Over the past few decades, coronavirus outbreaks have been reported globally. To date, seven human coronaviruses have been identified, among which only SARS-CoV-2 has been extensively studied, resulting in the development of several approved antiviral drugs. To effectively combat both current and emerging coronaviruses, there is an urgent need for a comprehensive database that consolidates information on all known human coronaviruses and their potential antiviral compounds. In response, we present HCoVDB-a comprehensive database that integrates genomic data, viral proteins, and antiviral agents with demonstrated in vitro or in vivo activity against the seven human coronaviruses. Compared to existing coronavirus databases, HCoVDB offers three distinctive features: (i) a curated collection and annotation of over 4 million genomic sequences from all seven human coronaviruses, including key amino acid substitutions that influence viral fitness, drug resistance, and immune evasion; (ii) a protein-drug docking platform for predicting the binding interactions of antiviral agents with demonstrated activity; and (iii) an extensive compilation of antiviral agents, along with their chemical properties and antiviral efficacy profiles (IC50, EC50, or CC50) as reported in the literature. Overall, HCoVDB provides a valuable resource for tracking the evolutionary dynamics of coronaviruses and accelerating the development of broad-spectrum antiviral agents against coronavirus infections in the future. Database URL: http://hcovdb.denglab.org/.
{"title":"HCoVDB: a comprehensive database encompassing viral genomes, drug targets, and therapeutics of human coronaviruses.","authors":"Pan Zhang, Tianxiang Ouyang, Xiaowen Hu, Jie Huang, Biao Xiao, Zhijian Huang, Xingyang Shi, Xinyi Wu, Linying Chen, Yongkang Wu, Hanyue Wang, Ying Zhang, Guangdi Li, Hui Liu, Lei Deng","doi":"10.1093/database/baaf079","DOIUrl":"10.1093/database/baaf079","url":null,"abstract":"<p><p>Over the past few decades, coronavirus outbreaks have been reported globally. To date, seven human coronaviruses have been identified, among which only SARS-CoV-2 has been extensively studied, resulting in the development of several approved antiviral drugs. To effectively combat both current and emerging coronaviruses, there is an urgent need for a comprehensive database that consolidates information on all known human coronaviruses and their potential antiviral compounds. In response, we present HCoVDB-a comprehensive database that integrates genomic data, viral proteins, and antiviral agents with demonstrated in vitro or in vivo activity against the seven human coronaviruses. Compared to existing coronavirus databases, HCoVDB offers three distinctive features: (i) a curated collection and annotation of over 4 million genomic sequences from all seven human coronaviruses, including key amino acid substitutions that influence viral fitness, drug resistance, and immune evasion; (ii) a protein-drug docking platform for predicting the binding interactions of antiviral agents with demonstrated activity; and (iii) an extensive compilation of antiviral agents, along with their chemical properties and antiviral efficacy profiles (IC50, EC50, or CC50) as reported in the literature. Overall, HCoVDB provides a valuable resource for tracking the evolutionary dynamics of coronaviruses and accelerating the development of broad-spectrum antiviral agents against coronavirus infections in the future. Database URL: http://hcovdb.denglab.org/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12648390/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145602880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Protein sequence alignments are evolutionary models and offer as starting points for the recognition of additional members of a homologous family and design of experiments. However, the accuracy of sequence alignments is obscured at the superfamily level due to distant relationships. Where structures of proteins are available, distantly related proteins can be aligned, guided by structural features. The Protein Alignment Organized as Structural Superfamilies (PASS2) database offers such structure-based sequence alignments for protein domains classified within superfamilies, as per the Structural Classification of Proteins extended (SCOPe) framework. The present update of PASS2 (PASS2.8) corresponds to the latest SCOPe release (version 2.08). This release comprises data for 26 690 protein domains exhibiting less than 40% sequence identity, organized into 2058 superfamilies. Several features derived from these alignments, including conserved secondary structural motifs, hidden Markov models (HMMs), conserved residues, and interactions across superfamilies, are also provided. For superfamilies containing divergent members, a k-means clustering algorithm has been employed to identify outliers and partition domains into split superfamilies. Novel features in this update include topological diagrams of the domains, potential interactors for each domain, and an updated methodology for identifying conserved interactions across superfamilies. This version of the database can be reached from http://caps.ncbs.res.in/pass2.
蛋白质序列比对是一种进化模型,为识别同源家族的其他成员和设计实验提供了起点。然而,由于远亲关系,序列比对的准确性在超家族水平上是模糊的。在蛋白质结构可用的地方,可以根据结构特征对远亲蛋白质进行排列。按照蛋白质结构分类扩展(SCOPe)框架,组织为结构超家族的蛋白质结构域(Protein Alignment Organized as Structural Superfamilies, PASS2)数据库提供了这种基于结构的序列比对。PASS2的当前更新(PASS2.8)对应于最新的SCOPe版本(2.08版本)。该版本包括26690个蛋白质结构域的数据,显示少于40%的序列同一性,组织成2058个超家族。本文还提供了这些比对的几个特征,包括保守的二级结构基序、隐马尔可夫模型(hmm)、保守残数和超家族之间的相互作用。对于包含不同成员的超家族,采用k-means聚类算法识别异常值并将域划分为分裂的超家族。本次更新的新特性包括域的拓扑图,每个域的潜在交互器,以及用于识别跨超家族的保守交互的更新方法。这个版本的数据库可以从http://caps.ncbs.res.in/pass2访问。
{"title":"PASS2: update of database of structure-based sequence alignments.","authors":"Revathy Menon, Soumya Nayak, Rama Rajesh, Ramanathan Sowdhamini","doi":"10.1093/database/baaf072","DOIUrl":"10.1093/database/baaf072","url":null,"abstract":"<p><p>Protein sequence alignments are evolutionary models and offer as starting points for the recognition of additional members of a homologous family and design of experiments. However, the accuracy of sequence alignments is obscured at the superfamily level due to distant relationships. Where structures of proteins are available, distantly related proteins can be aligned, guided by structural features. The Protein Alignment Organized as Structural Superfamilies (PASS2) database offers such structure-based sequence alignments for protein domains classified within superfamilies, as per the Structural Classification of Proteins extended (SCOPe) framework. The present update of PASS2 (PASS2.8) corresponds to the latest SCOPe release (version 2.08). This release comprises data for 26 690 protein domains exhibiting less than 40% sequence identity, organized into 2058 superfamilies. Several features derived from these alignments, including conserved secondary structural motifs, hidden Markov models (HMMs), conserved residues, and interactions across superfamilies, are also provided. For superfamilies containing divergent members, a k-means clustering algorithm has been employed to identify outliers and partition domains into split superfamilies. Novel features in this update include topological diagrams of the domains, potential interactors for each domain, and an updated methodology for identifying conserved interactions across superfamilies. This version of the database can be reached from http://caps.ncbs.res.in/pass2.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12612674/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145502494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-18DOI: 10.1093/database/baaf057
Rubén González-Miguéns, Àlex Gàlvez-Morante, Margarita Skamnelou, Meritxell Antó, Elena Casacuberta, Daniel J Richter, Enrique Lara, Daniel Vaulot, Javier Del Campo, Iñaki Ruiz-Trillo
Metabarcoding has emerged as a robust method for assessing biodiversity patterns by retrieving environmental DNA directly from ecosystems. While the 18S rRNA gene is the primary genetic marker used for broad eukaryotic metabarcoding, it has limitations in resolving lower taxonomic levels. A potential alternative is the mitochondrial cytochrome oxidase subunit I (COI) gene because it offers resolution at the species level. However, the COI gene lacks a comprehensive, curated taxonomically informed database including protists. To address this gap, we introduce eKOI, a novel, curated COI gene database designed to enhance the taxonomic annotation for protists that can be used for COI-based metabarcoding. eKOI integrates data from GenBank and mitochondrial genomes, followed by extensive manual curation to eliminate redundancies and contaminants, recovering 15 947 sequences within 80 eukaryotic phyla. We validated the use of eKOI by reannotating several COI metabarcoding datasets, revealing previously unidentified protist biodiversity and demonstrating the database utility for community-level analyses.
{"title":"A novel taxonomic database for eukaryotic mitochondrial cytochrome oxidase subunit I gene (eKOI), with a focus on protists diversity.","authors":"Rubén González-Miguéns, Àlex Gàlvez-Morante, Margarita Skamnelou, Meritxell Antó, Elena Casacuberta, Daniel J Richter, Enrique Lara, Daniel Vaulot, Javier Del Campo, Iñaki Ruiz-Trillo","doi":"10.1093/database/baaf057","DOIUrl":"10.1093/database/baaf057","url":null,"abstract":"<p><p>Metabarcoding has emerged as a robust method for assessing biodiversity patterns by retrieving environmental DNA directly from ecosystems. While the 18S rRNA gene is the primary genetic marker used for broad eukaryotic metabarcoding, it has limitations in resolving lower taxonomic levels. A potential alternative is the mitochondrial cytochrome oxidase subunit I (COI) gene because it offers resolution at the species level. However, the COI gene lacks a comprehensive, curated taxonomically informed database including protists. To address this gap, we introduce eKOI, a novel, curated COI gene database designed to enhance the taxonomic annotation for protists that can be used for COI-based metabarcoding. eKOI integrates data from GenBank and mitochondrial genomes, followed by extensive manual curation to eliminate redundancies and contaminants, recovering 15 947 sequences within 80 eukaryotic phyla. We validated the use of eKOI by reannotating several COI metabarcoding datasets, revealing previously unidentified protist biodiversity and demonstrating the database utility for community-level analyses.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462617/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145136613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-18DOI: 10.1093/database/baaf048
Peter Selby, Rafael Abbeloos, Anne-Francoise Adam-Blondon, Francisco J Agosto-Pérez, Michael Alaux, Isabelle Alic, Khaled Al-Shamaa, Johan Steven Aparicio, Jan Erik Backlund, Aldrin Batac, Sebastian Beier, Gabriel Besombes, Alice Boizet, Matthijs Brouwer, Terry Casstevens, Arnaud Charleroy, Keo Corak, Chaney Courtney, Mariano Crimi, Gouripriya Davuluri, Kauê de Sousa, Jeremy Destin, Stijn Dhondt, Ajay Dhungana, Bert Droesbeke, Manuel Feser, Mirella Flores-Gonzalez, Valentin Guignon, Corina Habito, Asis Hallab, Jenna Hershberger, Puthick Hok, Amanda M Hulse-Kemp, Lynn Carol Johnson, Sook Jung, Paul Kersey, Andrzej Kilian, Patrick König, Suman Kumar, Josh Lamos-Sweeney, Laszlo Lang, Matthias Lange, Marie-Angélique Laporte, Taein Lee, Erwan Le Floch, Francisco López, Brandon Madriz, Dorrie Main, Marco Marsella, Maud Marty, Célia Michotey, Zachary Miller, Iain Milne, Lukas A Mueller, Moses Nderitu, Pascal Neveu, Nick Palladino, Tim Parsons, Cyril Pommier, Jean-François Rami, Sebastian Raubach, Trevor Rife, Kelly Robbins, Mathieu Rouard, Joseph Ruff, Guilhem Sempéré, Romil Mayank Shah, Paul Shaw, Becky Smith, Nahuel Soldevilla, Anne Tireau, Clarysabel Tovar, Grzegorz Uszynski, Vivian Bass Vega, Stephan Weise, Shawn C Yarnes, The BrAPI Consortium
Population growth and the impacts of climate change are placing increasing pressure on global agriculture and breeding programmes. Recent advancements in phenotyping techniques, genotyping technologies, and predictive modelling are accelerating genetic gains in breeding programmes, helping researchers and breeders develop improved crops more efficiently. However, these advancements have also led to an overwhelming torrent of fragmented data, creating significant challenges in data integration and management. To address this issue, the Breeding Application Programming Interface (BrAPI) project was established as a standardized data model for breeding data. BrAPI is an international, community-driven effort that facilitates interoperability among databases and tools, improving the sharing and interpretation of breeding-related data. This open-source standard is software-agnostic and can be used by anyone interested in breeding, phenotyping, germplasm, genotyping, and agronomy data management. This manuscript provides an overview of the BrAPI project, highlighting the significant progress made in the development of the data standard and the expansion of its community. It also presents a showcase of the wide variety of BrAPI-compatible tools that have been built to enhance breeding and research activities, demonstrating how the project is advancing agricultural innovation and data management practices.
{"title":"BrAPI v2: real-world applications for data integration and collaboration in the breeding and genetics community.","authors":"Peter Selby, Rafael Abbeloos, Anne-Francoise Adam-Blondon, Francisco J Agosto-Pérez, Michael Alaux, Isabelle Alic, Khaled Al-Shamaa, Johan Steven Aparicio, Jan Erik Backlund, Aldrin Batac, Sebastian Beier, Gabriel Besombes, Alice Boizet, Matthijs Brouwer, Terry Casstevens, Arnaud Charleroy, Keo Corak, Chaney Courtney, Mariano Crimi, Gouripriya Davuluri, Kauê de Sousa, Jeremy Destin, Stijn Dhondt, Ajay Dhungana, Bert Droesbeke, Manuel Feser, Mirella Flores-Gonzalez, Valentin Guignon, Corina Habito, Asis Hallab, Jenna Hershberger, Puthick Hok, Amanda M Hulse-Kemp, Lynn Carol Johnson, Sook Jung, Paul Kersey, Andrzej Kilian, Patrick König, Suman Kumar, Josh Lamos-Sweeney, Laszlo Lang, Matthias Lange, Marie-Angélique Laporte, Taein Lee, Erwan Le Floch, Francisco López, Brandon Madriz, Dorrie Main, Marco Marsella, Maud Marty, Célia Michotey, Zachary Miller, Iain Milne, Lukas A Mueller, Moses Nderitu, Pascal Neveu, Nick Palladino, Tim Parsons, Cyril Pommier, Jean-François Rami, Sebastian Raubach, Trevor Rife, Kelly Robbins, Mathieu Rouard, Joseph Ruff, Guilhem Sempéré, Romil Mayank Shah, Paul Shaw, Becky Smith, Nahuel Soldevilla, Anne Tireau, Clarysabel Tovar, Grzegorz Uszynski, Vivian Bass Vega, Stephan Weise, Shawn C Yarnes, The BrAPI Consortium","doi":"10.1093/database/baaf048","DOIUrl":"10.1093/database/baaf048","url":null,"abstract":"<p><p>Population growth and the impacts of climate change are placing increasing pressure on global agriculture and breeding programmes. Recent advancements in phenotyping techniques, genotyping technologies, and predictive modelling are accelerating genetic gains in breeding programmes, helping researchers and breeders develop improved crops more efficiently. However, these advancements have also led to an overwhelming torrent of fragmented data, creating significant challenges in data integration and management. To address this issue, the Breeding Application Programming Interface (BrAPI) project was established as a standardized data model for breeding data. BrAPI is an international, community-driven effort that facilitates interoperability among databases and tools, improving the sharing and interpretation of breeding-related data. This open-source standard is software-agnostic and can be used by anyone interested in breeding, phenotyping, germplasm, genotyping, and agronomy data management. This manuscript provides an overview of the BrAPI project, highlighting the significant progress made in the development of the data standard and the expansion of its community. It also presents a showcase of the wide variety of BrAPI-compatible tools that have been built to enhance breeding and research activities, demonstrating how the project is advancing agricultural innovation and data management practices.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462623/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145136701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-18DOI: 10.1093/database/baaf044
Harald Hutter, Mehrdad Moosavi, Nelly Mafi
GExplore is an online tool to assist with large-scale data mining of selected datasets related to gene and protein function in Caenorhabditis elegans. Here, we describe the current version GExplore 1.5, which contains new datasets and display options as well as a completely redesigned web interface. GExplore now consists of six databases. The gene database contains protein domain information, general expression, and phenotype data as well as interacting genes, gene ontology annotations, and disease associations. The mutation database contains a curated list of more than 200 000 mutations affecting the protein sequences of all protein-coding genes. The protein database contains proteome data from 19 different nematode species, four genetic model organisms and the human proteome for comparison. Three genome-scale RNAseq expression databases contain expression profiles of different developmental stages from embryo to adult, tissues-specific expression profiles at the L2 stage, and expression profiles of the major tissues in the developing embryo at five different time points from gastrulation to the beginning of terminal differentiation. The web-based user interface has been completely redeveloped for the current version. The search interfaces allow users to explore content of the individual databases in detail. The interactive display pages enable the user to fine-tune the results, display additional data, and download the results. GExplore is a tool to quickly obtain an overview of biological and biochemical functions of large groups of genes or identify genes with a certain combination of features for further experimental analysis. Database URL: https://genome.science.sfu.ca/gexplore.
{"title":"GExplore 1.5: a comprehensive Caenorhabditis elegans database for the analysis of gene function with a new user-friendly web interface.","authors":"Harald Hutter, Mehrdad Moosavi, Nelly Mafi","doi":"10.1093/database/baaf044","DOIUrl":"10.1093/database/baaf044","url":null,"abstract":"<p><p>GExplore is an online tool to assist with large-scale data mining of selected datasets related to gene and protein function in Caenorhabditis elegans. Here, we describe the current version GExplore 1.5, which contains new datasets and display options as well as a completely redesigned web interface. GExplore now consists of six databases. The gene database contains protein domain information, general expression, and phenotype data as well as interacting genes, gene ontology annotations, and disease associations. The mutation database contains a curated list of more than 200 000 mutations affecting the protein sequences of all protein-coding genes. The protein database contains proteome data from 19 different nematode species, four genetic model organisms and the human proteome for comparison. Three genome-scale RNAseq expression databases contain expression profiles of different developmental stages from embryo to adult, tissues-specific expression profiles at the L2 stage, and expression profiles of the major tissues in the developing embryo at five different time points from gastrulation to the beginning of terminal differentiation. The web-based user interface has been completely redeveloped for the current version. The search interfaces allow users to explore content of the individual databases in detail. The interactive display pages enable the user to fine-tune the results, display additional data, and download the results. GExplore is a tool to quickly obtain an overview of biological and biochemical functions of large groups of genes or identify genes with a certain combination of features for further experimental analysis. Database URL: https://genome.science.sfu.ca/gexplore.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462625/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145136731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-18DOI: 10.1093/database/baaf052
Paula Iglesias-Rivas, Roberto Del Amparo, Javier A Cabaleiro, Miguel Arenas
Substitution models of protein evolution describe the patterns of amino acid substitutions over evolutionary time and are fundamental for probabilistic methods of phylogenetic inference. At the protein level, a variety of substitution models are available, but only empirical substitution models are well established in phylogenetics due to their mathematical simplicity. Despite their importance, a database compiling the large number of currently available empirical substitution models of protein evolution is lacking, although such a resource could facilitate access, assessment, and subsequent implementation of these models into phylogenetic frameworks. Besides, little is known about formal comparisons between the current set of empirical substitution models. We present EModelDB, a database of empirical substitution models of protein evolution required for probabilistic protein phylogenetics that includes the corresponding exchangeability matrices, model classification, and model-specific biological information. The database is integrated into a graphical user interface, written in Python and SQL, that facilitates its usability. We also compared common empirical substitution models in terms of the distance between their relative rates of amino acid substitution and amino frequencies at equilibrium. We found that substitution models derived from proteins related in nature tend to cluster together, reflecting similar evolutionary patterns. Indeed, we evaluated the empirical substitution models in terms of the folding stability of the derived modeled proteins and found that they generally produce less stable proteins compared to real proteins, suggesting that substitution models with additional evolutionary constraints can be preferred for studying protein evolution accounting for folding stability. Database URL: https://github.com/Paula-Iglesias-Rivas/EModelDB.
{"title":"Empirical substitution models of protein evolution: database, relationships, and modeling considerations.","authors":"Paula Iglesias-Rivas, Roberto Del Amparo, Javier A Cabaleiro, Miguel Arenas","doi":"10.1093/database/baaf052","DOIUrl":"10.1093/database/baaf052","url":null,"abstract":"<p><p>Substitution models of protein evolution describe the patterns of amino acid substitutions over evolutionary time and are fundamental for probabilistic methods of phylogenetic inference. At the protein level, a variety of substitution models are available, but only empirical substitution models are well established in phylogenetics due to their mathematical simplicity. Despite their importance, a database compiling the large number of currently available empirical substitution models of protein evolution is lacking, although such a resource could facilitate access, assessment, and subsequent implementation of these models into phylogenetic frameworks. Besides, little is known about formal comparisons between the current set of empirical substitution models. We present EModelDB, a database of empirical substitution models of protein evolution required for probabilistic protein phylogenetics that includes the corresponding exchangeability matrices, model classification, and model-specific biological information. The database is integrated into a graphical user interface, written in Python and SQL, that facilitates its usability. We also compared common empirical substitution models in terms of the distance between their relative rates of amino acid substitution and amino frequencies at equilibrium. We found that substitution models derived from proteins related in nature tend to cluster together, reflecting similar evolutionary patterns. Indeed, we evaluated the empirical substitution models in terms of the folding stability of the derived modeled proteins and found that they generally produce less stable proteins compared to real proteins, suggesting that substitution models with additional evolutionary constraints can be preferred for studying protein evolution accounting for folding stability. Database URL: https://github.com/Paula-Iglesias-Rivas/EModelDB.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462380/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145136768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements highlight the importance of large-scale causal inference in elucidating disease mechanisms and guiding public health strategies. Mendelian randomization (MR) has become a cornerstone method for identifying causal relationships by leveraging genetic variants as instrumental variables. However, existing tools lack flexibility for multivariable analyses and fail to integrate diverse datasets effectively. To address these challenges, we introduce MRdb, a comprehensive database designed for conducting both univariable and multivariable MR analyses. MRdb encompasses 12 distinct categories of exposure data, including but not limited to 19 126 expression quantitative trait loci genes, 4907 plasma proteins, and 1400 plasma metabolites. Additionally, it integrates 48 507 disease outcomes sourced from FinnGen R10 and the IEU Open GWAS Project. MRdb offers robust data preprocessing features, including handling missing statistics, harmonizing datasets, and selecting instrumental variables to ensure high-quality analyses. Collectively, MRdb bridges the gaps in existing tools by integrating diverse datasets with user-friendly functionalities, empowering researchers to explore complex causal mechanisms.
{"title":"MRdb: a comprehensive database of univariate and multivariate Mendelian randomization with large-scale GWAS summary data.","authors":"Qian Liu, Yujie Zhang, Houxing Li, Jiatong Li, Mengyu Xin, Rui Sun, Yifan Dai, Xinxin Shan, Yuting He, Borui Xu, Shangwei Ning, Peng Wang, Qiuyan Guo","doi":"10.1093/database/baaf054","DOIUrl":"10.1093/database/baaf054","url":null,"abstract":"<p><p>Recent advancements highlight the importance of large-scale causal inference in elucidating disease mechanisms and guiding public health strategies. Mendelian randomization (MR) has become a cornerstone method for identifying causal relationships by leveraging genetic variants as instrumental variables. However, existing tools lack flexibility for multivariable analyses and fail to integrate diverse datasets effectively. To address these challenges, we introduce MRdb, a comprehensive database designed for conducting both univariable and multivariable MR analyses. MRdb encompasses 12 distinct categories of exposure data, including but not limited to 19 126 expression quantitative trait loci genes, 4907 plasma proteins, and 1400 plasma metabolites. Additionally, it integrates 48 507 disease outcomes sourced from FinnGen R10 and the IEU Open GWAS Project. MRdb offers robust data preprocessing features, including handling missing statistics, harmonizing datasets, and selecting instrumental variables to ensure high-quality analyses. Collectively, MRdb bridges the gaps in existing tools by integrating diverse datasets with user-friendly functionalities, empowering researchers to explore complex causal mechanisms.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145136794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}