Pub Date : 2025-07-08DOI: 10.1093/database/baaf041
{"title":"Correction to: GymnoTOA-db: a database and application to optimize functional annotation in gymnosperms.","authors":"","doi":"10.1093/database/baaf041","DOIUrl":"10.1093/database/baaf041","url":null,"abstract":"","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12560801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144583336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-03DOI: 10.1093/database/baaf025
Fang-Yi Su, Gia-Han Ngo, Ben Phan, Jung-Hsien Chiang
Biomedical relation extraction often involves datasets with implicit constraints, where structural, syntactic, or semantic rules must be strictly preserved to maintain data integrity. Traditional data augmentation techniques struggle in these scenarios, as they risk violating domain-specific constraints. To address these challenges, we propose CAS (Constrained Augmentation and Semantic-Quality), a novel framework designed for constrained datasets. CAS employs large language models to generate diverse data variations while adhering to predefined rules, and it integrates the SemQ Filter. This self-evaluation mechanism ensures the quality and consistency of augmented data by filtering out noisy or semantically incongruent samples. Although CAS is primarily designed for biomedical relation extraction, its versatile design extends its applicability to tasks with implicit constraints, such as code completion, mathematical reasoning, and information retrieval. Through extensive experiments across multiple domains, CAS demonstrates its ability to enhance model performance by maintaining structural fidelity and semantic accuracy in augmented data. These results highlight the potential of CAS not only in advancing biomedical NLP research but also in addressing data augmentation challenges in diverse constrained-task settings within natural language processing. Database URL: https://github.com/ngogiahan149/CAS.
{"title":"CAS: enhancing implicit constrained data augmentation with semantic enrichment for biomedical relation extraction and beyond.","authors":"Fang-Yi Su, Gia-Han Ngo, Ben Phan, Jung-Hsien Chiang","doi":"10.1093/database/baaf025","DOIUrl":"10.1093/database/baaf025","url":null,"abstract":"<p><p>Biomedical relation extraction often involves datasets with implicit constraints, where structural, syntactic, or semantic rules must be strictly preserved to maintain data integrity. Traditional data augmentation techniques struggle in these scenarios, as they risk violating domain-specific constraints. To address these challenges, we propose CAS (Constrained Augmentation and Semantic-Quality), a novel framework designed for constrained datasets. CAS employs large language models to generate diverse data variations while adhering to predefined rules, and it integrates the SemQ Filter. This self-evaluation mechanism ensures the quality and consistency of augmented data by filtering out noisy or semantically incongruent samples. Although CAS is primarily designed for biomedical relation extraction, its versatile design extends its applicability to tasks with implicit constraints, such as code completion, mathematical reasoning, and information retrieval. Through extensive experiments across multiple domains, CAS demonstrates its ability to enhance model performance by maintaining structural fidelity and semantic accuracy in augmented data. These results highlight the potential of CAS not only in advancing biomedical NLP research but also in addressing data augmentation challenges in diverse constrained-task settings within natural language processing. Database URL: https://github.com/ngogiahan149/CAS.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12224179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144552558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-30DOI: 10.1093/database/baaf027
Muhammad Nabeel Asim, Tayyaba Asif, Faiza Hassan, Andreas Dengel
Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.
{"title":"Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.","authors":"Muhammad Nabeel Asim, Tayyaba Asif, Faiza Hassan, Andreas Dengel","doi":"10.1093/database/baaf027","DOIUrl":"10.1093/database/baaf027","url":null,"abstract":"<p><p>Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12125710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144191613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper describes our biomedical relation extraction system, which is designed to participate in the BioCreative VIII challenge Track 1: BioRED Track, which emphasizes the relation extraction from biomedical literature. Our system employs an ensemble learning method, leveraging the PubTator API in conjunction with multiple pretrained bidirectional encoder representations from transformer (BERT) models. Various preprocessing inputs are incorporated, encompassing prompt questions, entity ID pairs, and co-occurrence contexts. To enhance model comprehension, special tokens and boundary tags are incorporated. Specifically, we utilize PubMedBERT alongside the Max Rule ensemble learning mechanism to amalgamate outputs from diverse classifiers. Our findings surpass the established benchmark score, thereby providing a robust benchmark for evaluating performance in this task. Moreover, our study introduces and demonstrates the effectiveness of a data-centric approach, emphasizing the significance of prioritizing high-quality data instances in enhancing model performance and robustness.
{"title":"Enhancing biomedical relation extraction through data-centric and preprocessing-robust ensemble learning approach.","authors":"Wilailack Meesawad, Jen-Chieh Han, Chun-Yu Hsueh, Yu Zhang, Hsi-Chuan Hung, Richard Tzong-Han Tsai","doi":"10.1093/database/baae127","DOIUrl":"10.1093/database/baae127","url":null,"abstract":"<p><p>The paper describes our biomedical relation extraction system, which is designed to participate in the BioCreative VIII challenge Track 1: BioRED Track, which emphasizes the relation extraction from biomedical literature. Our system employs an ensemble learning method, leveraging the PubTator API in conjunction with multiple pretrained bidirectional encoder representations from transformer (BERT) models. Various preprocessing inputs are incorporated, encompassing prompt questions, entity ID pairs, and co-occurrence contexts. To enhance model comprehension, special tokens and boundary tags are incorporated. Specifically, we utilize PubMedBERT alongside the Max Rule ensemble learning mechanism to amalgamate outputs from diverse classifiers. Our findings surpass the established benchmark score, thereby providing a robust benchmark for evaluating performance in this task. Moreover, our study introduces and demonstrates the effectiveness of a data-centric approach, emphasizing the significance of prioritizing high-quality data instances in enhancing model performance and robustness.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12097206/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-20DOI: 10.1093/database/baaf008
Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz
We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.
{"title":"An exploratory study combining Virtual Reality and Semantic Web for life science research using Graph2VR.","authors":"Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz","doi":"10.1093/database/baaf008","DOIUrl":"https://doi.org/10.1093/database/baaf008","url":null,"abstract":"<p><p>We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-20DOI: 10.1093/database/baaf008
Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz
We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.
{"title":"An exploratory study combining Virtual Reality and Semantic Web for life science research using Graph2VR.","authors":"Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz","doi":"10.1093/database/baaf008","DOIUrl":"10.1093/database/baaf008","url":null,"abstract":"<p><p>We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12090995/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144110024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain superfamilies, involving a one-time computational exercise to search for trusted homologs of protein domains of known structures against the vast sequence database. We have updated this database employing advanced bioinformatics tools, including DELTA-BLAST (domain enhanced lookup time accelerated BLAST) for initial detection of hits and HMMSCAN for validation, significantly improving the accuracy of domain identification. Using these tools, over 151 million sequence homologs for 2060 superfamilies [SCOPe (Structural Classification of Proteins extended)] were identified and 116 million out of them were validated as true positives. Through a case study on glycolysis-related enzymes, variations in domain architectures of these enzymes are explored, revealing evolutionary changes and functional diversity among these essential proteins. We present another case, LOG gene, where one can tune in and find significant mutations across the evolutionary lineage. The GenDiS database, GenDiS3, and the associated tools made available at https://caps.ncbs.res.in/gendis3/ offer a powerful resource for researchers in functional annotation and evolutionary studies. Database URL: https://caps.ncbs.res.in/gendis3/.
尽管有大量可用的序列数据,但已确定的蛋白质序列数量与已解决的相对较少的结构之间存在显着差异。这种差异凸显了结构生物学在序列信息和3D结构数据之间建立桥梁的挑战,以及建立能够将遥远同源物与已知结构联系起来的强大数据库的必要性。研究表明,尽管蛋白质种类繁多,但结构褶皱的数量有限。因此,计算工具可以提高我们对蛋白质序列进行分类的能力,在它们的结构被确定或功能被表征之前,从而弥合了序列和结构数据之间的差距。GenDiS(基因组分布超家族)是蛋白质结构域超家族基因组分布信息的存储库,涉及一次性计算练习,以搜索已知结构的蛋白质结构域的可靠同源物,而不是庞大的序列数据库。我们使用先进的生物信息学工具更新了该数据库,包括用于初始检测命中的DELTA-BLAST(域增强查找时间加速BLAST)和用于验证的HMMSCAN,显着提高了域识别的准确性。使用这些工具,鉴定了2060个超家族[SCOPe (Structural Classification of Proteins extended)]的1.51亿个序列同源物,其中1.16亿个被验证为真阳性。通过对糖酵解相关酶的案例研究,探讨了这些酶结构域结构的变化,揭示了这些必需蛋白质的进化变化和功能多样性。我们提出了另一种情况,LOG基因,在这种情况下,人们可以调谐并发现进化谱系中的重大突变。GenDiS数据库,GenDiS3和相关的工具在https://caps.ncbs.res.in/gendis3/上提供了功能注释和进化研究的研究人员一个强大的资源。数据库地址:https://caps.ncbs.res.in/gendis3/。
{"title":"GenDiS3 database: census on the prevalence of protein domain superfamilies of known structure in the entire sequence database.","authors":"Sarthak Joshi, Shailendu Mohapatra, Dhwani Kumar, Adwait Joshi, Meenakshi Iyer, Ramanathan Sowdhamini","doi":"10.1093/database/baaf035","DOIUrl":"10.1093/database/baaf035","url":null,"abstract":"<p><p>Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain superfamilies, involving a one-time computational exercise to search for trusted homologs of protein domains of known structures against the vast sequence database. We have updated this database employing advanced bioinformatics tools, including DELTA-BLAST (domain enhanced lookup time accelerated BLAST) for initial detection of hits and HMMSCAN for validation, significantly improving the accuracy of domain identification. Using these tools, over 151 million sequence homologs for 2060 superfamilies [SCOPe (Structural Classification of Proteins extended)] were identified and 116 million out of them were validated as true positives. Through a case study on glycolysis-related enzymes, variations in domain architectures of these enzymes are explored, revealing evolutionary changes and functional diversity among these essential proteins. We present another case, LOG gene, where one can tune in and find significant mutations across the evolutionary lineage. The GenDiS database, GenDiS3, and the associated tools made available at https://caps.ncbs.res.in/gendis3/ offer a powerful resource for researchers in functional annotation and evolutionary studies. Database URL: https://caps.ncbs.res.in/gendis3/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12063530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143978712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain superfamilies, involving a one-time computational exercise to search for trusted homologs of protein domains of known structures against the vast sequence database. We have updated this database employing advanced bioinformatics tools, including DELTA-BLAST (domain enhanced lookup time accelerated BLAST) for initial detection of hits and HMMSCAN for validation, significantly improving the accuracy of domain identification. Using these tools, over 151 million sequence homologs for 2060 superfamilies [SCOPe (Structural Classification of Proteins extended)] were identified and 116 million out of them were validated as true positives. Through a case study on glycolysis-related enzymes, variations in domain architectures of these enzymes are explored, revealing evolutionary changes and functional diversity among these essential proteins. We present another case, LOG gene, where one can tune in and find significant mutations across the evolutionary lineage. The GenDiS database, GenDiS3, and the associated tools made available at https://caps.ncbs.res.in/gendis3/ offer a powerful resource for researchers in functional annotation and evolutionary studies. Database URL: https://caps.ncbs.res.in/gendis3/.
尽管有大量可用的序列数据,但已确定的蛋白质序列数量与已解决的相对较少的结构之间存在显着差异。这种差异凸显了结构生物学在序列信息和3D结构数据之间建立桥梁的挑战,以及建立能够将遥远同源物与已知结构联系起来的强大数据库的必要性。研究表明,尽管蛋白质种类繁多,但结构褶皱的数量有限。因此,计算工具可以提高我们对蛋白质序列进行分类的能力,在它们的结构被确定或功能被表征之前,从而弥合了序列和结构数据之间的差距。GenDiS(基因组分布超家族)是蛋白质结构域超家族基因组分布信息的存储库,涉及一次性计算练习,以搜索已知结构的蛋白质结构域的可靠同源物,而不是庞大的序列数据库。我们使用先进的生物信息学工具更新了该数据库,包括用于初始检测命中的DELTA-BLAST(域增强查找时间加速BLAST)和用于验证的HMMSCAN,显着提高了域识别的准确性。使用这些工具,鉴定了2060个超家族[SCOPe (Structural Classification of Proteins extended)]的1.51亿个序列同源物,其中1.16亿个被验证为真阳性。通过对糖酵解相关酶的案例研究,探讨了这些酶结构域结构的变化,揭示了这些必需蛋白质的进化变化和功能多样性。我们提出了另一种情况,LOG基因,在这种情况下,人们可以调谐并发现进化谱系中的重大突变。GenDiS数据库,GenDiS3和相关的工具在https://caps.ncbs.res.in/gendis3/上提供了功能注释和进化研究的研究人员一个强大的资源。数据库地址:https://caps.ncbs.res.in/gendis3/。
{"title":"GenDiS3 database: census on the prevalence of protein domain superfamilies of known structure in the entire sequence database.","authors":"Sarthak Joshi, Shailendu Mohapatra, Dhwani Kumar, Adwait Joshi, Meenakshi Iyer, Ramanathan Sowdhamini","doi":"10.1093/database/baaf035","DOIUrl":"https://doi.org/10.1093/database/baaf035","url":null,"abstract":"<p><p>Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain superfamilies, involving a one-time computational exercise to search for trusted homologs of protein domains of known structures against the vast sequence database. We have updated this database employing advanced bioinformatics tools, including DELTA-BLAST (domain enhanced lookup time accelerated BLAST) for initial detection of hits and HMMSCAN for validation, significantly improving the accuracy of domain identification. Using these tools, over 151 million sequence homologs for 2060 superfamilies [SCOPe (Structural Classification of Proteins extended)] were identified and 116 million out of them were validated as true positives. Through a case study on glycolysis-related enzymes, variations in domain architectures of these enzymes are explored, revealing evolutionary changes and functional diversity among these essential proteins. We present another case, LOG gene, where one can tune in and find significant mutations across the evolutionary lineage. The GenDiS database, GenDiS3, and the associated tools made available at https://caps.ncbs.res.in/gendis3/ offer a powerful resource for researchers in functional annotation and evolutionary studies. Database URL: https://caps.ncbs.res.in/gendis3/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-07DOI: 10.1093/database/baaf030
Milind Chauhan, Amisha Gupta, Ritu Tomer, Gajendra P S Raghava
CancerPPD2 (http://webs.iiitd.edu.in/raghava/cancerppd2/) is an updated version of CancerPPD, developed to maintain comprehensive information about anticancer peptides and proteins. It contains 6521 entries, each entry provides detailed information about an anticancer peptide/protein that include origin of the peptide, cancer cell line, type of cancer, peptide sequence, and structure. These anticancer peptides have been tested against 392 types of cancer cell lines and 28 types of cancer-associated tissues. In addition to natural anticancer peptides, CancerPPD2 contains 781 entries for chemically modified and 3018 entries for N-/C- terminus modified anticancer peptides. Few entries are also linked with 47 clinical studies and have provided the cross reference to Uniprot, DrugBank, and ThPDB2. The possible entries also linked with clinical trials. On average, CancerPPD2 contains around 85% more information than its previous version, CancerPPD. The structures of these anticancer peptides and proteins were either obtained from the Protein Data Bank (PDB) or predicted using PEPstrMOD, I-TASSER, and AlphaFold. A wide range of tools have been integrated into CancerPPD2 for data retrieval and similarity searches. Additionally, we integrated a REST API into this repository to facilitate automatic data retrieval via program. Database URL: https://webs.iiitd.edu.in/raghava/cancerppd2/api/rest.html.
{"title":"CancerPPD2: an updated repository of anticancer peptides and proteins.","authors":"Milind Chauhan, Amisha Gupta, Ritu Tomer, Gajendra P S Raghava","doi":"10.1093/database/baaf030","DOIUrl":"https://doi.org/10.1093/database/baaf030","url":null,"abstract":"<p><p>CancerPPD2 (http://webs.iiitd.edu.in/raghava/cancerppd2/) is an updated version of CancerPPD, developed to maintain comprehensive information about anticancer peptides and proteins. It contains 6521 entries, each entry provides detailed information about an anticancer peptide/protein that include origin of the peptide, cancer cell line, type of cancer, peptide sequence, and structure. These anticancer peptides have been tested against 392 types of cancer cell lines and 28 types of cancer-associated tissues. In addition to natural anticancer peptides, CancerPPD2 contains 781 entries for chemically modified and 3018 entries for N-/C- terminus modified anticancer peptides. Few entries are also linked with 47 clinical studies and have provided the cross reference to Uniprot, DrugBank, and ThPDB2. The possible entries also linked with clinical trials. On average, CancerPPD2 contains around 85% more information than its previous version, CancerPPD. The structures of these anticancer peptides and proteins were either obtained from the Protein Data Bank (PDB) or predicted using PEPstrMOD, I-TASSER, and AlphaFold. A wide range of tools have been integrated into CancerPPD2 for data retrieval and similarity searches. Additionally, we integrated a REST API into this repository to facilitate automatic data retrieval via program. Database URL: https://webs.iiitd.edu.in/raghava/cancerppd2/api/rest.html.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-07DOI: 10.1093/database/baaf036
An Phan, Parnal Joshi, Claus Kadelka, Iddo Friedberg
The resources required to study gene function are limited, especially when considering the number of genes in the human genome and the complexity of their function. Therefore, genes are prioritized for experimental studies based on many different considerations, including, but not limited to, perceived biomedical importance, such as disease-associated genes, or the understanding of biological processes, such as cell signalling pathways. At the same time, most genes are not studied or are under-characterized, which hampers our understanding of their function and potential effects on human health and wellness. Understanding function annotation disparity is a necessary first step toward understanding how much functional knowledge is gained from the human genome, and toward guidelines for better targeting future studies of the genes in the human genome effectively. Here, we present a comprehensive longitudinal analysis of the human proteome utilizing data analysis tools from economics and information theory. Specifically, we view the human proteome as a population of proteins within a knowledge economy: we treat the quantified knowledge of the protein's function as the analogue of wealth and examine the distribution of information in a population of proteins in the proteome in the same manner distribution of wealth is studied in societies. Our results show a highly skewed distribution of information about human proteins over the last decade, in which the inequality in the annotations given to the proteins remains high. Additionally, we examine the correlation between the knowledge about protein function as captured in databases and the interest in proteins as reflected by mentions in the scientific literature. We show a large gap between knowledge and interest and dissect the factors leading to this gap. In conclusion, our study shows that research efforts should be redirected to less studied proteins to mitigate the disparity among human proteins both in databases and literature.
{"title":"A longitudinal analysis of function annotations of the human proteome reveals consistently high biases.","authors":"An Phan, Parnal Joshi, Claus Kadelka, Iddo Friedberg","doi":"10.1093/database/baaf036","DOIUrl":"https://doi.org/10.1093/database/baaf036","url":null,"abstract":"<p><p>The resources required to study gene function are limited, especially when considering the number of genes in the human genome and the complexity of their function. Therefore, genes are prioritized for experimental studies based on many different considerations, including, but not limited to, perceived biomedical importance, such as disease-associated genes, or the understanding of biological processes, such as cell signalling pathways. At the same time, most genes are not studied or are under-characterized, which hampers our understanding of their function and potential effects on human health and wellness. Understanding function annotation disparity is a necessary first step toward understanding how much functional knowledge is gained from the human genome, and toward guidelines for better targeting future studies of the genes in the human genome effectively. Here, we present a comprehensive longitudinal analysis of the human proteome utilizing data analysis tools from economics and information theory. Specifically, we view the human proteome as a population of proteins within a knowledge economy: we treat the quantified knowledge of the protein's function as the analogue of wealth and examine the distribution of information in a population of proteins in the proteome in the same manner distribution of wealth is studied in societies. Our results show a highly skewed distribution of information about human proteins over the last decade, in which the inequality in the annotations given to the proteins remains high. Additionally, we examine the correlation between the knowledge about protein function as captured in databases and the interest in proteins as reflected by mentions in the scientific literature. We show a large gap between knowledge and interest and dissect the factors leading to this gap. In conclusion, our study shows that research efforts should be redirected to less studied proteins to mitigate the disparity among human proteins both in databases and literature.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}