Pub Date : 2024-08-05DOI: 10.1093/database/baae066
Sumit Madan, Lisa Kühnel, Holger Fröhlich, Martin Hofmann-Apitius, Juliane Fluck
MicroRNAs (miRNAs) play important roles in post-transcriptional processes and regulate major cellular functions. The abnormal regulation of expression of miRNAs has been linked to numerous human diseases such as respiratory diseases, cancer, and neurodegenerative diseases. Latest miRNA-disease associations are predominantly found in unstructured biomedical literature. Retrieving these associations manually can be cumbersome and time-consuming due to the continuously expanding number of publications. We propose a deep learning-based text mining approach that extracts normalized miRNA-disease associations from biomedical literature. To train the deep learning models, we build a new training corpus that is extended by distant supervision utilizing multiple external databases. A quantitative evaluation shows that the workflow achieves an area under receiver operator characteristic curve of 98% on a holdout test set for the detection of miRNA-disease associations. We demonstrate the applicability of the approach by extracting new miRNA-disease associations from biomedical literature (PubMed and PubMed Central). We have shown through quantitative analysis and evaluation on three different neurodegenerative diseases that our approach can effectively extract miRNA-disease associations not yet available in public databases. Database URL: https://zenodo.org/records/10523046.
{"title":"Dataset of miRNA-disease relations extracted from textual data using transformer-based neural networks.","authors":"Sumit Madan, Lisa Kühnel, Holger Fröhlich, Martin Hofmann-Apitius, Juliane Fluck","doi":"10.1093/database/baae066","DOIUrl":"10.1093/database/baae066","url":null,"abstract":"<p><p>MicroRNAs (miRNAs) play important roles in post-transcriptional processes and regulate major cellular functions. The abnormal regulation of expression of miRNAs has been linked to numerous human diseases such as respiratory diseases, cancer, and neurodegenerative diseases. Latest miRNA-disease associations are predominantly found in unstructured biomedical literature. Retrieving these associations manually can be cumbersome and time-consuming due to the continuously expanding number of publications. We propose a deep learning-based text mining approach that extracts normalized miRNA-disease associations from biomedical literature. To train the deep learning models, we build a new training corpus that is extended by distant supervision utilizing multiple external databases. A quantitative evaluation shows that the workflow achieves an area under receiver operator characteristic curve of 98% on a holdout test set for the detection of miRNA-disease associations. We demonstrate the applicability of the approach by extracting new miRNA-disease associations from biomedical literature (PubMed and PubMed Central). We have shown through quantitative analysis and evaluation on three different neurodegenerative diseases that our approach can effectively extract miRNA-disease associations not yet available in public databases. Database URL: https://zenodo.org/records/10523046.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11300841/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141893078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1093/database/baae074
Chunhui Xu, Trey Shaw, Sai Akhil Choppararu, Yiwei Lu, Shaik Naveed Farooq, Yongfang Qin, Matt Hudson, Brock Weekley, Michael Fisher, Fei He, Jose Roberto Da Silva Nascimento, Nicholas Wergeles, Trupti Joshi, Philip D Bates, Abraham J Koo, Doug K Allen, Edgar B Cahoon, Jay J Thelen, Dong Xu
FatPlants, an open-access, web-based database, consolidates data, annotations, analysis results, and visualizations of lipid-related genes, proteins, and metabolic pathways in plants. Serving as a minable resource, FatPlants offers a user-friendly interface for facilitating studies into the regulation of plant lipid metabolism and supporting breeding efforts aimed at increasing crop oil content. This web resource, developed using data derived from our own research, curated from public resources, and gleaned from academic literature, comprises information on known fatty-acid-related proteins, genes, and pathways in multiple plants, with an emphasis on Glycine max, Arabidopsis thaliana, and Camelina sativa. Furthermore, the platform includes machine-learning based methods and navigation tools designed to aid in characterizing metabolic pathways and protein interactions. Comprehensive gene and protein information cards, a Basic Local Alignment Search Tool search function, similar structure search capacities from AphaFold, and ChatGPT-based query for protein information are additional features. Database URL: https://www.fatplants.net/.
{"title":"FatPlants: a comprehensive information system for lipid-related genes and metabolic pathways in plants.","authors":"Chunhui Xu, Trey Shaw, Sai Akhil Choppararu, Yiwei Lu, Shaik Naveed Farooq, Yongfang Qin, Matt Hudson, Brock Weekley, Michael Fisher, Fei He, Jose Roberto Da Silva Nascimento, Nicholas Wergeles, Trupti Joshi, Philip D Bates, Abraham J Koo, Doug K Allen, Edgar B Cahoon, Jay J Thelen, Dong Xu","doi":"10.1093/database/baae074","DOIUrl":"10.1093/database/baae074","url":null,"abstract":"<p><p>FatPlants, an open-access, web-based database, consolidates data, annotations, analysis results, and visualizations of lipid-related genes, proteins, and metabolic pathways in plants. Serving as a minable resource, FatPlants offers a user-friendly interface for facilitating studies into the regulation of plant lipid metabolism and supporting breeding efforts aimed at increasing crop oil content. This web resource, developed using data derived from our own research, curated from public resources, and gleaned from academic literature, comprises information on known fatty-acid-related proteins, genes, and pathways in multiple plants, with an emphasis on Glycine max, Arabidopsis thaliana, and Camelina sativa. Furthermore, the platform includes machine-learning based methods and navigation tools designed to aid in characterizing metabolic pathways and protein interactions. Comprehensive gene and protein information cards, a Basic Local Alignment Search Tool search function, similar structure search capacities from AphaFold, and ChatGPT-based query for protein information are additional features. Database URL: https://www.fatplants.net/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11300840/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141893079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1093/database/baae068
Richard A A Jonker, Tiago Almeida, Rui Antunes, João R Almeida, Sérgio Matos
The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.
从临床叙述中识别医学概念对改善治疗或药物开发研究具有重要意义,因此在生物医学科学界引起了广泛关注。临床文本中的生物医学命名实体识别(NER)对于自动信息提取、促进病历分析、药物开发和医学研究至关重要。传统方法通常侧重于单类命名实体识别任务,但最近的研究进展强调了处理多类场景的必要性,尤其是在复杂的生物医学领域。本文提出了一种整合多头条件随机场(CRF)分类器的策略,用于西班牙临床文档中的多类 NER。我们的方法通过使用多头 CRF 模型,克服了传统 NER 方法中常见的挑战--不同类型实体实例重叠的问题。这种架构提高了计算效率,确保了多类 NER 任务的可扩展性,并保持了高性能。通过结合 SympTEMIST、MedProcNER、DisTEMIST 和 PharmaCoNER 这四个不同的数据集,我们将 NER 的范围扩展到了五个类别:症状、程序、疾病、化学物质和蛋白质。据我们所知,这些数据集的组合创造了西班牙最大的多类数据集,其重点是临床笔记的生物医学实体识别和链接,这对训练西班牙语生物医学模型非常重要。我们还提供了与多语言系统化医学临床术语(SNOMED CT)词汇的实体链接,最终目标是进行生物医学关系提取。通过对西班牙语临床文档的实验和评估,我们的策略提供了与单类 NER 模型相比具有竞争力的结果。在 NER 方面,我们的系统取得了 78.73 的综合微平均 F1 分数,而根据 SNOMED CT 规范化的临床提及则取得了 54.51 的端到端 F1 分数。运行我们系统的代码可通过 https://github.com/ieeta-pt/Multi-Head-CRF 公开获取。数据库网址:https://github.com/ieeta-pt/Multi-Head-CRF。
{"title":"Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes.","authors":"Richard A A Jonker, Tiago Almeida, Rui Antunes, João R Almeida, Sérgio Matos","doi":"10.1093/database/baae068","DOIUrl":"10.1093/database/baae068","url":null,"abstract":"<p><p>The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11290360/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141859304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large amounts of important medical information are captured in free-text documents in biomedical research and within healthcare systems, which can be made accessible through natural language processing (NLP). A key component in most biomedical NLP pipelines is entity linking, i.e. grounding textual mentions of named entities to a reference of medical concepts, usually derived from a terminology system, such as the Systematized Nomenclature of Medicine Clinical Terms. However, complex entity mentions, spanning multiple tokens, are notoriously hard to normalize due to the difficulty of finding appropriate candidate concepts. In this work, we propose an approach to preprocess such mentions for candidate generation, building upon recent advances in text simplification with generative large language models. We evaluate the feasibility of our method in the context of the entity linking track of the BioCreative VIII SympTEMIST shared task. We find that instructing the latest Generative Pre-trained Transformer model with a few-shot prompt for text simplification results in mention spans that are easier to normalize. Thus, we can improve recall during candidate generation by 2.9 percentage points compared to our baseline system, which achieved the best score in the original shared task evaluation. Furthermore, we show that this improvement in recall can be fully translated into top-1 accuracy through careful initialization of a subsequent reranking model. Our best system achieves an accuracy of 63.6% on the SympTEMIST test set. The proposed approach has been integrated into the open-source xMEN toolkit, which is available online via https://github.com/hpi-dhc/xmen.
在生物医学研究和医疗保健系统中,大量重要的医学信息被记录在自由文本文件中,这些信息可以通过自然语言处理(NLP)来获取。大多数生物医学 NLP 管道中的一个关键组成部分是实体链接,即把命名实体的文本提及与医学概念的参考文献联系起来,医学概念的参考文献通常来自术语系统,如《医学临床术语系统命名法》(Systematized Nomenclature of Medicine Clinical Terms)。然而,由于难以找到合适的候选概念,跨越多个标记的复杂实体提及很难规范化。在这项工作中,我们提出了一种预处理此类提及以便生成候选概念的方法,该方法基于最近在使用生成式大语言模型进行文本简化方面取得的进展。我们在 BioCreative VIII SympTEMIST 共享任务的实体链接轨道中评估了我们方法的可行性。我们发现,使用最新的生成式预训练转换器模型,并对文本简化进行少量提示,会使提及跨度更容易归一化。因此,与我们的基线系统相比,我们可以将候选词生成过程中的召回率提高 2.9 个百分点。此外,我们还证明,通过对后续重排模型进行仔细的初始化,这种召回率的提高完全可以转化为最高的准确率。我们的最佳系统在 SympTEMIST 测试集上达到了 63.6% 的准确率。我们提出的方法已被集成到开源的 xMEN 工具包中,该工具包可通过 https://github.com/hpi-dhc/xmen 在线获取。
{"title":"Improving biomedical entity linking for complex entity mentions with LLM-based text simplification.","authors":"Florian Borchert, Ignacio Llorca, Matthieu-P Schapranow","doi":"10.1093/database/baae067","DOIUrl":"10.1093/database/baae067","url":null,"abstract":"<p><p>Large amounts of important medical information are captured in free-text documents in biomedical research and within healthcare systems, which can be made accessible through natural language processing (NLP). A key component in most biomedical NLP pipelines is entity linking, i.e. grounding textual mentions of named entities to a reference of medical concepts, usually derived from a terminology system, such as the Systematized Nomenclature of Medicine Clinical Terms. However, complex entity mentions, spanning multiple tokens, are notoriously hard to normalize due to the difficulty of finding appropriate candidate concepts. In this work, we propose an approach to preprocess such mentions for candidate generation, building upon recent advances in text simplification with generative large language models. We evaluate the feasibility of our method in the context of the entity linking track of the BioCreative VIII SympTEMIST shared task. We find that instructing the latest Generative Pre-trained Transformer model with a few-shot prompt for text simplification results in mention spans that are easier to normalize. Thus, we can improve recall during candidate generation by 2.9 percentage points compared to our baseline system, which achieved the best score in the original shared task evaluation. Furthermore, we show that this improvement in recall can be fully translated into top-1 accuracy through careful initialization of a subsequent reranking model. Our best system achieves an accuracy of 63.6% on the SympTEMIST test set. The proposed approach has been integrated into the open-source xMEN toolkit, which is available online via https://github.com/hpi-dhc/xmen.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11281847/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biological databases serve as critical basics for modern research, and amid the dynamic landscape of biology, the COVID-19 database has emerged as an indispensable resource. The global outbreak of Covid-19, commencing in December 2019, necessitates comprehensive databases to unravel the intricate connections between this novel virus and cancer. Despite existing databases, a crucial need persists for a centralized and accessible method to acquire precise information within the research community. The main aim of the work is to develop a database which has all the COVID-19-related data available in just one click with auto global notifications. This gap is addressed by the meticulously designed COVID-19 Pandemic Database (CO-19 PDB 2.0), positioned as a comprehensive resource for researchers navigating the complexities of COVID-19 and cancer. Between December 2019 and June 2024, the CO-19 PDB 2.0 systematically collected and organized 120 datasets into six distinct categories, each catering to specific functionalities. These categories encompass a chemical structure database, a digital image database, a visualization tool database, a genomic database, a social science database, and a literature database. Functionalities range from image analysis and gene sequence information to data visualization and updates on environmental events. CO-19 PDB 2.0 has the option to choose either the search page for the database or the autonotification page, providing a seamless retrieval of information. The dedicated page introduces six predefined charts, providing insights into crucial criteria such as the number of cases and deaths', country-wise distribution, 'new cases and recovery', and rates of death and recovery. The global impact of COVID-19 on cancer patients has led to extensive collaboration among research institutions, producing numerous articles and computational studies published in international journals. A key feature of this initiative is auto daily notifications for standardized information updates. Users can easily navigate based on different categories or use a direct search option. The study offers up-to-date COVID-19 datasets and global statistics on COVID-19 and cancer, highlighting the top 10 cancers diagnosed in the USA in 2022. Breast and prostate cancers are the most common, representing 30% and 26% of new cases, respectively. The initiative also ensures the removal or replacement of dead links, providing a valuable resource for researchers, healthcare professionals, and individuals. The database has been implemented in PHP, HTML, CSS and MySQL and is available freely at https://www.co-19pdb.habdsk.org/. Database URL: https://www.co-19pdb.habdsk.org/.
{"title":"CO-19 PDB 2.0: A Comprehensive COVID-19 Database with Global Auto-Alerts, Statistical Analysis, and Cancer Correlations.","authors":"Shahid Ullah, Yingmei Li, Wajeeha Rahman, Farhan Ullah, Muhammad Ijaz, Anees Ullah, Gulzar Ahmad, Hameed Ullah, Tianshun Gao","doi":"10.1093/database/baae072","DOIUrl":"10.1093/database/baae072","url":null,"abstract":"<p><p>Biological databases serve as critical basics for modern research, and amid the dynamic landscape of biology, the COVID-19 database has emerged as an indispensable resource. The global outbreak of Covid-19, commencing in December 2019, necessitates comprehensive databases to unravel the intricate connections between this novel virus and cancer. Despite existing databases, a crucial need persists for a centralized and accessible method to acquire precise information within the research community. The main aim of the work is to develop a database which has all the COVID-19-related data available in just one click with auto global notifications. This gap is addressed by the meticulously designed COVID-19 Pandemic Database (CO-19 PDB 2.0), positioned as a comprehensive resource for researchers navigating the complexities of COVID-19 and cancer. Between December 2019 and June 2024, the CO-19 PDB 2.0 systematically collected and organized 120 datasets into six distinct categories, each catering to specific functionalities. These categories encompass a chemical structure database, a digital image database, a visualization tool database, a genomic database, a social science database, and a literature database. Functionalities range from image analysis and gene sequence information to data visualization and updates on environmental events. CO-19 PDB 2.0 has the option to choose either the search page for the database or the autonotification page, providing a seamless retrieval of information. The dedicated page introduces six predefined charts, providing insights into crucial criteria such as the number of cases and deaths', country-wise distribution, 'new cases and recovery', and rates of death and recovery. The global impact of COVID-19 on cancer patients has led to extensive collaboration among research institutions, producing numerous articles and computational studies published in international journals. A key feature of this initiative is auto daily notifications for standardized information updates. Users can easily navigate based on different categories or use a direct search option. The study offers up-to-date COVID-19 datasets and global statistics on COVID-19 and cancer, highlighting the top 10 cancers diagnosed in the USA in 2022. Breast and prostate cancers are the most common, representing 30% and 26% of new cases, respectively. The initiative also ensures the removal or replacement of dead links, providing a valuable resource for researchers, healthcare professionals, and individuals. The database has been implemented in PHP, HTML, CSS and MySQL and is available freely at https://www.co-19pdb.habdsk.org/. Database URL: https://www.co-19pdb.habdsk.org/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11281848/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1093/database/baae060
Enio Gjerga, Matthias Dewenter, Thiago Britto-Borges, Johannes Grosso, Frank Stein, Jessica Eschenbach, Mandy Rettel, Johannes Backs, Christoph Dieterich
Time-course multi-omics data of a murine model of progressive heart failure (HF) induced by transverse aortic constriction (TAC) provide insights into the molecular mechanisms that are causatively involved in contractile failure and structural cardiac remodelling. We employ Illumina-based transcriptomics, Nanopore sequencing and mass spectrometry-based proteomics on samples from the left ventricle (LV) and right ventricle (RV, RNA only) of the heart at 1, 7, 21 and 56 days following TAC and Sham surgery. Here, we present Transverse Aortic COnstriction Multi-omics Analysis (TACOMA), as an interactive web application that integrates and visualizes transcriptomics and proteomics data collected in a TAC time-course experiment. TACOMA enables users to visualize the expression profile of known and novel genes and protein products thereof. Importantly, we capture alternative splicing events by assessing differential transcript and exon usage as well. Co-expression-based clustering algorithms and functional enrichment analysis revealed overrepresented annotations of biological processes and molecular functions at the protein and gene levels. To enhance data integration, TACOMA synchronizes transcriptomics and proteomics profiles, enabling cross-omics comparisons. With TACOMA (https://shiny.dieterichlab.org/app/tacoma), we offer a rich web-based resource to uncover molecular events and biological processes implicated in contractile failure and cardiac hypertrophy. For example, we highlight: (i) changes in metabolic genes and proteins in the time course of hypertrophic growth and contractile impairment; (ii) identification of RNA splicing changes in the expression of Tpm2 isoforms between RV and LV; and (iii) novel transcripts and genes likely contributing to the pathogenesis of HF. We plan to extend these data with additional environmental and genetic models of HF to decipher common and distinct molecular changes in heart diseases of different aetiologies. Database URL: https://shiny.dieterichlab.org/app/tacoma.
{"title":"Transverse aortic constriction multi-omics analysis uncovers pathophysiological cardiac molecular mechanisms.","authors":"Enio Gjerga, Matthias Dewenter, Thiago Britto-Borges, Johannes Grosso, Frank Stein, Jessica Eschenbach, Mandy Rettel, Johannes Backs, Christoph Dieterich","doi":"10.1093/database/baae060","DOIUrl":"10.1093/database/baae060","url":null,"abstract":"<p><p>Time-course multi-omics data of a murine model of progressive heart failure (HF) induced by transverse aortic constriction (TAC) provide insights into the molecular mechanisms that are causatively involved in contractile failure and structural cardiac remodelling. We employ Illumina-based transcriptomics, Nanopore sequencing and mass spectrometry-based proteomics on samples from the left ventricle (LV) and right ventricle (RV, RNA only) of the heart at 1, 7, 21 and 56 days following TAC and Sham surgery. Here, we present Transverse Aortic COnstriction Multi-omics Analysis (TACOMA), as an interactive web application that integrates and visualizes transcriptomics and proteomics data collected in a TAC time-course experiment. TACOMA enables users to visualize the expression profile of known and novel genes and protein products thereof. Importantly, we capture alternative splicing events by assessing differential transcript and exon usage as well. Co-expression-based clustering algorithms and functional enrichment analysis revealed overrepresented annotations of biological processes and molecular functions at the protein and gene levels. To enhance data integration, TACOMA synchronizes transcriptomics and proteomics profiles, enabling cross-omics comparisons. With TACOMA (https://shiny.dieterichlab.org/app/tacoma), we offer a rich web-based resource to uncover molecular events and biological processes implicated in contractile failure and cardiac hypertrophy. For example, we highlight: (i) changes in metabolic genes and proteins in the time course of hypertrophic growth and contractile impairment; (ii) identification of RNA splicing changes in the expression of Tpm2 isoforms between RV and LV; and (iii) novel transcripts and genes likely contributing to the pathogenesis of HF. We plan to extend these data with additional environmental and genetic models of HF to decipher common and distinct molecular changes in heart diseases of different aetiologies. Database URL: https://shiny.dieterichlab.org/app/tacoma.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11270014/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141757630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-24DOI: 10.1093/database/baae063
Laura Krumpholz, Aleksandra Klimczyk, Wiktoria Bieniek, Sebastian Polak, Barbara Wiśniowska
In vitro-in vivo extrapolation is a commonly applied technique for liver clearance prediction. Various in vitro models are available such as hepatocytes, human liver microsomes, or recombinant cytochromes P450. According to the free drug theory, only the unbound fraction (fu) of a chemical can undergo metabolic changes. Therefore, to ensure the reliability of predictions, both specific and nonspecific binding in the model should be accounted. However, the fraction unbound in the experiment is often not reported. The study aimed to provide a detailed repository of the literature data on the compound's fu value in various in vitro systems used for drug metabolism evaluation and corresponding human plasma binding levels. Data on the free fraction in plasma and different in vitro models were supplemented with the following information: the experimental method used for the assessment of the degree of drug binding, protein or cell concentration in the incubation, and other experimental conditions, if different from the standard ones, species, reference to the source publication, and the author's name and date of publication. In total, we collected 129 literature studies on 1425 different compounds. The provided data set can be used as a reference for scientists involved in pharmacokinetic/physiologically based pharmacokinetic modelling as well as researchers interested in Quantitative Structure-Activity Relationship models for the prediction of fraction unbound based on compound structure. Database URL: https://data.mendeley.com/datasets/3bs5526htd/1.
体外-体内外推法是预测肝脏清除率的常用技术。目前有多种体外模型,如肝细胞、人肝微粒体或重组细胞色素 P450。根据游离药物理论,只有未结合部分(fu)的化学物质才能发生代谢变化。因此,为确保预测的可靠性,模型中的特异性和非特异性结合都应考虑在内。然而,实验中未结合的部分往往没有报告。本研究旨在提供一个详细的文献数据库,其中包括化合物在用于药物代谢评价的各种体外系统中的 fu 值以及相应的人体血浆结合水平。有关血浆和不同体外模型中游离部分的数据均附有以下信息:用于评估药物结合程度的实验方法、培养过程中的蛋白质或细胞浓度以及其他实验条件(如果与标准条件不同)、物种、来源出版物的参考文献以及作者姓名和发表日期。我们总共收集了 129 篇关于 1425 种不同化合物的文献研究。所提供的数据集可作为参与药代动力学/生理学药代动力学建模的科学家以及对基于化合物结构预测未结合部分的定量结构-活性关系模型感兴趣的研究人员的参考资料。数据库网址:https://data.mendeley.com/datasets/3bs5526htd/1.
{"title":"Data set of fraction unbound values in the in vitro incubations for metabolic studies for better prediction of human clearance.","authors":"Laura Krumpholz, Aleksandra Klimczyk, Wiktoria Bieniek, Sebastian Polak, Barbara Wiśniowska","doi":"10.1093/database/baae063","DOIUrl":"10.1093/database/baae063","url":null,"abstract":"<p><p>In vitro-in vivo extrapolation is a commonly applied technique for liver clearance prediction. Various in vitro models are available such as hepatocytes, human liver microsomes, or recombinant cytochromes P450. According to the free drug theory, only the unbound fraction (fu) of a chemical can undergo metabolic changes. Therefore, to ensure the reliability of predictions, both specific and nonspecific binding in the model should be accounted. However, the fraction unbound in the experiment is often not reported. The study aimed to provide a detailed repository of the literature data on the compound's fu value in various in vitro systems used for drug metabolism evaluation and corresponding human plasma binding levels. Data on the free fraction in plasma and different in vitro models were supplemented with the following information: the experimental method used for the assessment of the degree of drug binding, protein or cell concentration in the incubation, and other experimental conditions, if different from the standard ones, species, reference to the source publication, and the author's name and date of publication. In total, we collected 129 literature studies on 1425 different compounds. The provided data set can be used as a reference for scientists involved in pharmacokinetic/physiologically based pharmacokinetic modelling as well as researchers interested in Quantitative Structure-Activity Relationship models for the prediction of fraction unbound based on compound structure. Database URL: https://data.mendeley.com/datasets/3bs5526htd/1.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11269425/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141757629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1093/database/baae065
Ofer Isakov, Dina Marek-Yagel, Rotem Greenberg, Michal Naftali, Shay Ben-Shachar
Targeted gene panel sequencing is used to limit the search for causative genetic variants solely to genes with an established association with the phenotype. The design of gene panels is challenging due to the lack of consensus regarding phenotypic associations for some genes, which results in high variation in gene composition for the same panel offered by different laboratories. We developed PANGEN, a platform that provides a centralized resource for gene panel information, with the ability to compare and generate new intelligent diagnostic panels. Gene-phenotype associations were collected from 12 public and commercial sources (Blueprint, Cegat, Centogene, ClinGen, Fulgent, GeneDx, Health in Code, Human Phenotype Ontology, Invitae, PanelApp, Prevention genetics, and Pronto diagnostics). Gene-phenotype associations are categorized into tiers according to categories derived from the original source panel. Pairwise panel similarity was calculated by dividing the number of common genes by the total number of genes in both panels. Regions with extreme guanine-cytosine (GC) content were collected from the Genome in a Bottle stratifications dataset, and putative genomic duplications were retrieved from the University of Santa Cruz database. Overall, 1533 panels, 9759 phenotypes, and 6979 genes were collected. The platform provides an interface to (i) explore and compare collected panels, (ii) find similar panels, (iii) identify genes with high GC content or duplication levels, (iv) generate gene panels by combining panels from various sources, and (v) stratify a generated panel into genes with a strong phenotype association ('core') and those with a weaker association ('extended'). The presented platform represents a unique resource for gene panel exploration and comparison that facilitates the generation of tailored diagnostic panels through a public online web server. Database URL: https://c-gc.shinyapps.io/PANGEN/.
{"title":"PANGEN: an online platform for the comparison and creation of diagnostic gene panels.","authors":"Ofer Isakov, Dina Marek-Yagel, Rotem Greenberg, Michal Naftali, Shay Ben-Shachar","doi":"10.1093/database/baae065","DOIUrl":"10.1093/database/baae065","url":null,"abstract":"<p><p>Targeted gene panel sequencing is used to limit the search for causative genetic variants solely to genes with an established association with the phenotype. The design of gene panels is challenging due to the lack of consensus regarding phenotypic associations for some genes, which results in high variation in gene composition for the same panel offered by different laboratories. We developed PANGEN, a platform that provides a centralized resource for gene panel information, with the ability to compare and generate new intelligent diagnostic panels. Gene-phenotype associations were collected from 12 public and commercial sources (Blueprint, Cegat, Centogene, ClinGen, Fulgent, GeneDx, Health in Code, Human Phenotype Ontology, Invitae, PanelApp, Prevention genetics, and Pronto diagnostics). Gene-phenotype associations are categorized into tiers according to categories derived from the original source panel. Pairwise panel similarity was calculated by dividing the number of common genes by the total number of genes in both panels. Regions with extreme guanine-cytosine (GC) content were collected from the Genome in a Bottle stratifications dataset, and putative genomic duplications were retrieved from the University of Santa Cruz database. Overall, 1533 panels, 9759 phenotypes, and 6979 genes were collected. The platform provides an interface to (i) explore and compare collected panels, (ii) find similar panels, (iii) identify genes with high GC content or duplication levels, (iv) generate gene panels by combining panels from various sources, and (v) stratify a generated panel into genes with a strong phenotype association ('core') and those with a weaker association ('extended'). The presented platform represents a unique resource for gene panel exploration and comparison that facilitates the generation of tailored diagnostic panels through a public online web server. Database URL: https://c-gc.shinyapps.io/PANGEN/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11265858/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141751328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1093/database/baae070
Sathishkumar Samiappan, B Santhana Krishnan, Damion Dehart, Landon R Jones, Jared A Elmore, Kristine O Evans, Raymond B Iglay
Drones (unoccupied aircraft systems) have become effective tools for wildlife monitoring and conservation. Automated animal detection and classification using artificial intelligence (AI) can substantially reduce logistical and financial costs and improve drone surveys. However, the lack of annotated animal imagery for training AI is a critical bottleneck in achieving accurate performance of AI algorithms compared to other fields. To bridge this gap for drone imagery and help advance and standardize automated animal classification, we have created the Aerial Wildlife Image Repository (AWIR), which is a dynamic, interactive database with annotated images captured from drone platforms using visible and thermal cameras. The AWIR provides the first open-access repository for users to upload, annotate, and curate images of animals acquired from drones. The AWIR also provides annotated imagery and benchmark datasets that users can download to train AI algorithms to automatically detect and classify animals, and compare algorithm performance. The AWIR contains 6587 animal objects in 1325 visible and thermal drone images of predominantly large birds and mammals of 13 species in open areas of North America. As contributors increase the taxonomic and geographic diversity of available images, the AWIR will open future avenues for AI research to improve animal surveys using drones for conservation applications. Database URL: https://projectportal.gri.msstate.edu/awir/.
{"title":"Aerial Wildlife Image Repository for animal monitoring with drones in the age of artificial intelligence.","authors":"Sathishkumar Samiappan, B Santhana Krishnan, Damion Dehart, Landon R Jones, Jared A Elmore, Kristine O Evans, Raymond B Iglay","doi":"10.1093/database/baae070","DOIUrl":"10.1093/database/baae070","url":null,"abstract":"<p><p>Drones (unoccupied aircraft systems) have become effective tools for wildlife monitoring and conservation. Automated animal detection and classification using artificial intelligence (AI) can substantially reduce logistical and financial costs and improve drone surveys. However, the lack of annotated animal imagery for training AI is a critical bottleneck in achieving accurate performance of AI algorithms compared to other fields. To bridge this gap for drone imagery and help advance and standardize automated animal classification, we have created the Aerial Wildlife Image Repository (AWIR), which is a dynamic, interactive database with annotated images captured from drone platforms using visible and thermal cameras. The AWIR provides the first open-access repository for users to upload, annotate, and curate images of animals acquired from drones. The AWIR also provides annotated imagery and benchmark datasets that users can download to train AI algorithms to automatically detect and classify animals, and compare algorithm performance. The AWIR contains 6587 animal objects in 1325 visible and thermal drone images of predominantly large birds and mammals of 13 species in open areas of North America. As contributors increase the taxonomic and geographic diversity of available images, the AWIR will open future avenues for AI research to improve animal surveys using drones for conservation applications. Database URL: https://projectportal.gri.msstate.edu/awir/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11265857/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141751327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Postoperative pulmonary complications (PPCs) are highly heterogeneous disorders with diverse risk factors frequently occurring after surgical interventions, resulting in significant financial burdens, prolonged hospitalization and elevated mortality rates. Despite the existence of multiple studies on PPCs, a comprehensive knowledge base that can effectively integrate and visualize the diverse risk factors associated with PPCs is currently lacking. This study aims to develop an online knowledge platform on risk factors for PPCs (Postoperative Pulmonary Complications Risk Factor Knowledge Base, PPCRKB) that categorizes and presents the risk and protective factors associated with PPCs, as well as to facilitate the development of individualized prevention and management strategies for PPCs based on the needs of each investigator. The PPCRKB is a novel knowledge base that encompasses all investigated potential risk factors linked to PPCs, offering users a web-based platform to access these risk factors. The PPCRKB contains 2673 entries, 915 risk factors that have been categorized into 11 distinct groups. These categories include habit and behavior, surgical factors, anesthetic factors, auxiliary examination, environmental factors, clinical status, medicines and treatment, demographic characteristics, psychosocial factors, genetic factors and miscellaneous factors. The PPCRKB holds significant value for PPC research. The inclusion of both quantitative and qualitative data in the PPCRKB enhances the ability to uncover new insights and solutions related to PPCs. It could provide clinicians with a more comprehensive perspective on research related to PPCs in future. Database URL: http://sysbio.org.cn/PPCs.
{"title":"PPCRKB: a risk factor knowledge base of postoperative pulmonary complications.","authors":"Jianchao Duan, Peiyi Li, Aibin Shao, Xuechao Hao, Ruihao Zhou, Cheng Bi, Xingyun Liu, Weimin Li, Huadong Zhu, Guo Chen, Bairong Shen, Tao Zhu","doi":"10.1093/database/baae054","DOIUrl":"10.1093/database/baae054","url":null,"abstract":"<p><p>Postoperative pulmonary complications (PPCs) are highly heterogeneous disorders with diverse risk factors frequently occurring after surgical interventions, resulting in significant financial burdens, prolonged hospitalization and elevated mortality rates. Despite the existence of multiple studies on PPCs, a comprehensive knowledge base that can effectively integrate and visualize the diverse risk factors associated with PPCs is currently lacking. This study aims to develop an online knowledge platform on risk factors for PPCs (Postoperative Pulmonary Complications Risk Factor Knowledge Base, PPCRKB) that categorizes and presents the risk and protective factors associated with PPCs, as well as to facilitate the development of individualized prevention and management strategies for PPCs based on the needs of each investigator. The PPCRKB is a novel knowledge base that encompasses all investigated potential risk factors linked to PPCs, offering users a web-based platform to access these risk factors. The PPCRKB contains 2673 entries, 915 risk factors that have been categorized into 11 distinct groups. These categories include habit and behavior, surgical factors, anesthetic factors, auxiliary examination, environmental factors, clinical status, medicines and treatment, demographic characteristics, psychosocial factors, genetic factors and miscellaneous factors. The PPCRKB holds significant value for PPC research. The inclusion of both quantitative and qualitative data in the PPCRKB enhances the ability to uncover new insights and solutions related to PPCs. It could provide clinicians with a more comprehensive perspective on research related to PPCs in future. Database URL: http://sysbio.org.cn/PPCs.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11259045/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141726929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}