Pub Date : 2025-02-15DOI: 10.1038/s41597-025-04537-4
Anna Sofia Lippolis, Giorgia Lodi, Andrea Giovanni Nuzzolese
Global sustainability challenges have recently led to an increasing interest in the management of water and health resources. Thus, the availability of effective, meaningful and open data is crucial to address those issues in the broader context of the Sustainable Development Goals of clean water and sanitation as targeted by the United Nations. In this paper, we present the Water Health Open Knowledge Graph (WHOW-KG) along with its design methodology and analysis on impact. Developed in the context of the EU-funded WHOW (Water Health Open Knowledge) project, the WHOW-KG is a semantic knowledge graph that models data on water consumption, pollution, extreme weather events, infectious disease rates and drug distribution. Indeed, it aims at supporting a wide range of applications: from knowledge discovery to decision-making, making it a valuable resource for researchers, policymakers, and practitioners in the water and health domains. The WHOW-KG consists of a network of five ontologies and related linked open data, modelled according to those ontologies. As a fully distributed system, it is sustainable over time, can handle large datasets, and allows data providers full control, establishing it as a vital European asset in the fields of water consumption and pollution.
{"title":"The Water Health Open Knowledge Graph.","authors":"Anna Sofia Lippolis, Giorgia Lodi, Andrea Giovanni Nuzzolese","doi":"10.1038/s41597-025-04537-4","DOIUrl":"https://doi.org/10.1038/s41597-025-04537-4","url":null,"abstract":"<p><p>Global sustainability challenges have recently led to an increasing interest in the management of water and health resources. Thus, the availability of effective, meaningful and open data is crucial to address those issues in the broader context of the Sustainable Development Goals of clean water and sanitation as targeted by the United Nations. In this paper, we present the Water Health Open Knowledge Graph (WHOW-KG) along with its design methodology and analysis on impact. Developed in the context of the EU-funded WHOW (Water Health Open Knowledge) project, the WHOW-KG is a semantic knowledge graph that models data on water consumption, pollution, extreme weather events, infectious disease rates and drug distribution. Indeed, it aims at supporting a wide range of applications: from knowledge discovery to decision-making, making it a valuable resource for researchers, policymakers, and practitioners in the water and health domains. The WHOW-KG consists of a network of five ontologies and related linked open data, modelled according to those ontologies. As a fully distributed system, it is sustainable over time, can handle large datasets, and allows data providers full control, establishing it as a vital European asset in the fields of water consumption and pollution.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"274"},"PeriodicalIF":5.8,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143425928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid development of perovskite solar devices has led to a rising number of publications over the past decade. As a result, a project aiming to compile all published device data was initiated in 2022. However, with its method of manual data collection, one of the project's hurdles is encouraging the participation of the perovskite community to spend time and effort in inputting new device data. To ensure the project's sustainability, adequate participation is necessary but is challenging to achieve. In response to this, we propose the utilization of natural language processing algorithms to extract various attributes of perovskite solar devices from journal articles. When data collection is performed by programs instead of humans, the lack of community participation can be overcome. For each device, the identifying device information, intrinsic device data, extrinsic cell definition, and the details of the fabrication procedure were extracted. A total of 30 attributes from 3164 journal articles were compiled, with an average accuracy of 0.899. The dataset and source code are made publicly available.
{"title":"Auto-generating a database on the fabrication details of perovskite solar devices.","authors":"Agnes Valencia, Fei Liu, Xiangyang Zhang, Xiangkun Bo, Weilu Li, Walid A Daoud","doi":"10.1038/s41597-025-04566-z","DOIUrl":"https://doi.org/10.1038/s41597-025-04566-z","url":null,"abstract":"<p><p>The rapid development of perovskite solar devices has led to a rising number of publications over the past decade. As a result, a project aiming to compile all published device data was initiated in 2022. However, with its method of manual data collection, one of the project's hurdles is encouraging the participation of the perovskite community to spend time and effort in inputting new device data. To ensure the project's sustainability, adequate participation is necessary but is challenging to achieve. In response to this, we propose the utilization of natural language processing algorithms to extract various attributes of perovskite solar devices from journal articles. When data collection is performed by programs instead of humans, the lack of community participation can be overcome. For each device, the identifying device information, intrinsic device data, extrinsic cell definition, and the details of the fabrication procedure were extracted. A total of 30 attributes from 3164 journal articles were compiled, with an average accuracy of 0.899. The dataset and source code are made publicly available.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"270"},"PeriodicalIF":5.8,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143425764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-14DOI: 10.1038/s41597-025-04532-9
Chenhui Shen, Guofeng Yang, Min Tang, Xiaofei Li, Li Zhu, Wei Li, Lin Jin, Pan Deng, Huanhuan Zhang, Qing Zhai, Gang Wu, Xiaohong Yan
Mylabris sibirica is a hypermetamorphic insect that primarily feeds on oilseed rape during the adult stage. However, the limited availability of genomic resources hinders our understanding of the gene function, medical use, and ecological adaptation in M. sibirica. Here, a high-quality chromosome-level genome of M. sibirica was generated by PacBio, Illumina, and Hi-C technologies. Its genome size was 138.45 Mb, with a scaffold N50 of 13.84 Mb and 99.85% (138.25 Mb) of the assembly anchors onto 10 pseudo-chromosomes. BUSCO analysis showed this genome assembly had a high-level completeness of 100% (n = 1,367), containing 1,358 (99.4%) single-copy BUSCOs and 8 (0.6%) duplicated BUSCOs. In addition, a total of 11,687 protein-coding genes and 35.46% (49.10 Mb) repetitive elements were identified. The high-quality genome assembly offers valuable genomic resources for exploring gene function, medical use, and ecology.
{"title":"A chromosome-level genome assembly of Mylabris sibirica Fischer von Waldheim, 1823 (Coleoptera, Meloidae).","authors":"Chenhui Shen, Guofeng Yang, Min Tang, Xiaofei Li, Li Zhu, Wei Li, Lin Jin, Pan Deng, Huanhuan Zhang, Qing Zhai, Gang Wu, Xiaohong Yan","doi":"10.1038/s41597-025-04532-9","DOIUrl":"https://doi.org/10.1038/s41597-025-04532-9","url":null,"abstract":"<p><p>Mylabris sibirica is a hypermetamorphic insect that primarily feeds on oilseed rape during the adult stage. However, the limited availability of genomic resources hinders our understanding of the gene function, medical use, and ecological adaptation in M. sibirica. Here, a high-quality chromosome-level genome of M. sibirica was generated by PacBio, Illumina, and Hi-C technologies. Its genome size was 138.45 Mb, with a scaffold N50 of 13.84 Mb and 99.85% (138.25 Mb) of the assembly anchors onto 10 pseudo-chromosomes. BUSCO analysis showed this genome assembly had a high-level completeness of 100% (n = 1,367), containing 1,358 (99.4%) single-copy BUSCOs and 8 (0.6%) duplicated BUSCOs. In addition, a total of 11,687 protein-coding genes and 35.46% (49.10 Mb) repetitive elements were identified. The high-quality genome assembly offers valuable genomic resources for exploring gene function, medical use, and ecology.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"269"},"PeriodicalIF":5.8,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143425762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-14DOI: 10.1038/s41597-025-04522-x
Mathilde Resell, Hanne-Line Rabben, Animesh Sharma, Lars Hagen, Linh Hoang, Nan T Skogaker, Anne Aarvik, Eirik Knudsen Bjåstad, Magnus K Svensson, Manoj Amrutkar, Caroline S Verbeke, Surinder K Batra, Gunnar Qvigstad, Timothy C Wang, Anil Rustgi, Duan Chen, Chun-Mei Zhao
Pancreatic ductal adenocarcinoma (PDAC) remains one of the most lethal malignancies, with a five-year survival rate of 10-15% due to late-stage diagnosis and limited efficacy of existing treatments. This study utilized proteomics-based systems modelling to generate multimodal datasets from various research models, including PDAC cells, spheroids, organoids, and tissues derived from murine and human samples. Identical mass spectrometry-based proteomics was applied across the different models. The preparation and validation of the research models and the proteomics were described in detail. The assembly datasets we present here contribute to the data collection on PDAC, which will be useful for systems modelling, data mining, knowledge discovery in databases, and bioinformatics of individual models. Further data analysis may lead to the generation of research hypotheses, predictions of targets for diagnosis and treatment, and relationships between data variables.
{"title":"Proteomics profiling of research models for studying pancreatic ductal adenocarcinoma.","authors":"Mathilde Resell, Hanne-Line Rabben, Animesh Sharma, Lars Hagen, Linh Hoang, Nan T Skogaker, Anne Aarvik, Eirik Knudsen Bjåstad, Magnus K Svensson, Manoj Amrutkar, Caroline S Verbeke, Surinder K Batra, Gunnar Qvigstad, Timothy C Wang, Anil Rustgi, Duan Chen, Chun-Mei Zhao","doi":"10.1038/s41597-025-04522-x","DOIUrl":"https://doi.org/10.1038/s41597-025-04522-x","url":null,"abstract":"<p><p>Pancreatic ductal adenocarcinoma (PDAC) remains one of the most lethal malignancies, with a five-year survival rate of 10-15% due to late-stage diagnosis and limited efficacy of existing treatments. This study utilized proteomics-based systems modelling to generate multimodal datasets from various research models, including PDAC cells, spheroids, organoids, and tissues derived from murine and human samples. Identical mass spectrometry-based proteomics was applied across the different models. The preparation and validation of the research models and the proteomics were described in detail. The assembly datasets we present here contribute to the data collection on PDAC, which will be useful for systems modelling, data mining, knowledge discovery in databases, and bioinformatics of individual models. Further data analysis may lead to the generation of research hypotheses, predictions of targets for diagnosis and treatment, and relationships between data variables.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"266"},"PeriodicalIF":5.8,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143425925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-14DOI: 10.1038/s41597-025-04528-5
Alice Tian, Sangbae Kim, Hasna Baidouri, Jin Li, Xuesen Cheng, Janice Vranka, Yumei Li, Rui Chen, VijayKrishna Raghunathan
The trabecular meshwork within the outflow apparatus is critical in maintaining intraocular pressure homeostasis. In vitro studies employing primary cell cultures of the human trabecular meshwork (hTM) have conventionally served as surrogates for investigating the pathobiology of TM dysfunction. Despite its abundant use, translation of outcomes from in vitro studies to ex vivo and/or in vivo studies remains a challenge. Given the cell heterogeneity, performing single-cell RNA sequencing comparing primary hTM cell cultures to hTM tissue may provide important insights on cellular identity and translatability, as such an approach has not been reported before. In this study, we assembled a total of 14 primary hTM in vitro samples across passages 1-4, including 4 samples from individuals diagnosed with glaucoma. This dataset offers a comprehensive transcriptomic resource of primary hTM in vitro scRNA-seq data to study global changes in gene expression in comparison to cells in tissue in situ. We have performed extensive preprocessing and quality control, allowing the research community to access and utilize this public resource.
{"title":"Divergence in cellular markers observed in single-cell transcriptomics datasets between cultured primary trabecular meshwork cells and tissues.","authors":"Alice Tian, Sangbae Kim, Hasna Baidouri, Jin Li, Xuesen Cheng, Janice Vranka, Yumei Li, Rui Chen, VijayKrishna Raghunathan","doi":"10.1038/s41597-025-04528-5","DOIUrl":"https://doi.org/10.1038/s41597-025-04528-5","url":null,"abstract":"<p><p>The trabecular meshwork within the outflow apparatus is critical in maintaining intraocular pressure homeostasis. In vitro studies employing primary cell cultures of the human trabecular meshwork (hTM) have conventionally served as surrogates for investigating the pathobiology of TM dysfunction. Despite its abundant use, translation of outcomes from in vitro studies to ex vivo and/or in vivo studies remains a challenge. Given the cell heterogeneity, performing single-cell RNA sequencing comparing primary hTM cell cultures to hTM tissue may provide important insights on cellular identity and translatability, as such an approach has not been reported before. In this study, we assembled a total of 14 primary hTM in vitro samples across passages 1-4, including 4 samples from individuals diagnosed with glaucoma. This dataset offers a comprehensive transcriptomic resource of primary hTM in vitro scRNA-seq data to study global changes in gene expression in comparison to cells in tissue in situ. We have performed extensive preprocessing and quality control, allowing the research community to access and utilize this public resource.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"264"},"PeriodicalIF":5.8,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143425780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-14DOI: 10.1038/s41597-024-04135-w
Franz Pablo Antezana Lopez, Guanhua Zhou, Guifei Jing, Kai Zhang, Liangfu Chen, Lin Chen, Yumin Tan
Accurate global carbon dioxide (CO2) distribution with high spatial and temporal resolution is essential for understanding its dynamics and impacts on climate change. This study tackles the challenge of data gaps in satellite observations of greenhouse gases, caused by orbital and observational limitations. We reconstructed a comprehensive dataset of Column-averaged CO2 (XCO2) concentrations by integrating re-analyzed data from the Copernicus Atmosphere Monitoring Service (CAMS) with observations from GOSAT and OCO-3 satellites. Using two advanced data reconstruction methods-Data Interpolating Empirical Orthogonal Functions (DINEOF) and Convolutional Auto-Encoder (DINCAE)-we imputed missing data, preserving spatial and temporal consistency. The combined approach achieved high accuracy, with Pearson correlation values between 0.94 and 0.95 against TCCON measurements, and we also reported root mean square error (RMSE) to assess model performance further. Our results indicate that these techniques generate a daily, high-resolution, gap-free XCO2 dataset, enabling improved CO2 monitoring, climate modeling, and policy development.
{"title":"Global Daily Column Average CO<sub>2</sub> at 0.1° × 0.1° Spatial Resolution Integrating OCO-3, GOSAT, CAMS with EOF and Deep Learning.","authors":"Franz Pablo Antezana Lopez, Guanhua Zhou, Guifei Jing, Kai Zhang, Liangfu Chen, Lin Chen, Yumin Tan","doi":"10.1038/s41597-024-04135-w","DOIUrl":"https://doi.org/10.1038/s41597-024-04135-w","url":null,"abstract":"<p><p>Accurate global carbon dioxide (CO<sub>2</sub>) distribution with high spatial and temporal resolution is essential for understanding its dynamics and impacts on climate change. This study tackles the challenge of data gaps in satellite observations of greenhouse gases, caused by orbital and observational limitations. We reconstructed a comprehensive dataset of Column-averaged CO2 (XCO<sub>2</sub>) concentrations by integrating re-analyzed data from the Copernicus Atmosphere Monitoring Service (CAMS) with observations from GOSAT and OCO-3 satellites. Using two advanced data reconstruction methods-Data Interpolating Empirical Orthogonal Functions (DINEOF) and Convolutional Auto-Encoder (DINCAE)-we imputed missing data, preserving spatial and temporal consistency. The combined approach achieved high accuracy, with Pearson correlation values between 0.94 and 0.95 against TCCON measurements, and we also reported root mean square error (RMSE) to assess model performance further. Our results indicate that these techniques generate a daily, high-resolution, gap-free XCO<sub>2</sub> dataset, enabling improved CO<sub>2</sub> monitoring, climate modeling, and policy development.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"268"},"PeriodicalIF":5.8,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143425906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-14DOI: 10.1038/s41597-024-04259-z
Mustafa Arikan, James Willoughby, Sevim Ongun, Ferenc Sallo, Andrea Montesel, Hend Ahmed, Ahmed Hagag, Marius Book, Henrik Faatz, Maria Vittoria Cicinelli, Amani A Fawzi, Dominika Podkowinski, Marketa Cilkova, Diana Morais De Almeida, Moussa Zouache, Ganesham Ramsamy, Watjana Lilaonitkul, Adam M Dubis
Publicly available open-access OCT datasets for retinal layer segmentation have been limited in scope, often being small in size, specific to a single disease, or containing only one grading. This dataset improves upon this with multi-grader and multi-disease labels for training machine learning-based algorithms. The proposed dataset covers three subsets of scans (Age-related Macular Degeneration, Diabetic Macular Edema, and healthy) and annotations for two types of tasks (semantic segmentation and object detection). This dataset compiled 5016 pixel-wise manual labels for 1672 OCT scans featuring 5 layer boundaries for three different disease classes to support development of automatic techniques. A subset of data (566 scans across 9 classes of disease biomarkers) was subsequently labeled for disease features for 4698 bounding box annotations. To minimize bias, images were shuffled and distributed among graders. Retinal layers were corrected, and outliers identified using the interquartile range (IQR). This step was iterated three times, improving layer annotations' quality iteratively, ensuring a reliable dataset for automated retinal image analysis.
{"title":"OCT5k: A dataset of multi-disease and multi-graded annotations for retinal layers.","authors":"Mustafa Arikan, James Willoughby, Sevim Ongun, Ferenc Sallo, Andrea Montesel, Hend Ahmed, Ahmed Hagag, Marius Book, Henrik Faatz, Maria Vittoria Cicinelli, Amani A Fawzi, Dominika Podkowinski, Marketa Cilkova, Diana Morais De Almeida, Moussa Zouache, Ganesham Ramsamy, Watjana Lilaonitkul, Adam M Dubis","doi":"10.1038/s41597-024-04259-z","DOIUrl":"https://doi.org/10.1038/s41597-024-04259-z","url":null,"abstract":"<p><p>Publicly available open-access OCT datasets for retinal layer segmentation have been limited in scope, often being small in size, specific to a single disease, or containing only one grading. This dataset improves upon this with multi-grader and multi-disease labels for training machine learning-based algorithms. The proposed dataset covers three subsets of scans (Age-related Macular Degeneration, Diabetic Macular Edema, and healthy) and annotations for two types of tasks (semantic segmentation and object detection). This dataset compiled 5016 pixel-wise manual labels for 1672 OCT scans featuring 5 layer boundaries for three different disease classes to support development of automatic techniques. A subset of data (566 scans across 9 classes of disease biomarkers) was subsequently labeled for disease features for 4698 bounding box annotations. To minimize bias, images were shuffled and distributed among graders. Retinal layers were corrected, and outliers identified using the interquartile range (IQR). This step was iterated three times, improving layer annotations' quality iteratively, ensuring a reliable dataset for automated retinal image analysis.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"267"},"PeriodicalIF":5.8,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143425909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-14DOI: 10.1038/s41597-025-04589-6
Martin J O'Connor, Josef Hardi, Marcos Martínez-Romero, Sowmya Somasundaram, Brendan Honick, Stephen A Fisher, Ajay Pillai, Mark A Musen
Scientists increasingly recognize the importance of providing rich, standards-adherent metadata to describe their experimental results. Despite the availability of sophisticated tools to assist in the process of data annotation, investigators generally seem to prefer to use spreadsheets when supplying metadata, despite the limitations of spreadsheets in ensuring metadata consistency and compliance with formal specifications. In this paper, we describe an end-to-end approach that supports spreadsheet-based entry of metadata, while ensuring rigorous adherence to community-based metadata standards and providing quality control. Our methods employ several key components, including customizable templates that represent metadata standards and that can inform the spreadsheets that investigators use to author metadata, controlled terminologies and ontologies for defining metadata values that can be accessed directly from a spreadsheet, and an interactive Web-based tool that allows users to rapidly identify and fix errors in their spreadsheet-based metadata. We demonstrate how this approach is being deployed in a biomedical consortium known as HuBMAP to define and collect metadata about a wide range of biological assays.
{"title":"Ensuring Adherence to Standards in Experiment-Related Metadata Entered Via Spreadsheets.","authors":"Martin J O'Connor, Josef Hardi, Marcos Martínez-Romero, Sowmya Somasundaram, Brendan Honick, Stephen A Fisher, Ajay Pillai, Mark A Musen","doi":"10.1038/s41597-025-04589-6","DOIUrl":"https://doi.org/10.1038/s41597-025-04589-6","url":null,"abstract":"<p><p>Scientists increasingly recognize the importance of providing rich, standards-adherent metadata to describe their experimental results. Despite the availability of sophisticated tools to assist in the process of data annotation, investigators generally seem to prefer to use spreadsheets when supplying metadata, despite the limitations of spreadsheets in ensuring metadata consistency and compliance with formal specifications. In this paper, we describe an end-to-end approach that supports spreadsheet-based entry of metadata, while ensuring rigorous adherence to community-based metadata standards and providing quality control. Our methods employ several key components, including customizable templates that represent metadata standards and that can inform the spreadsheets that investigators use to author metadata, controlled terminologies and ontologies for defining metadata values that can be accessed directly from a spreadsheet, and an interactive Web-based tool that allows users to rapidly identify and fix errors in their spreadsheet-based metadata. We demonstrate how this approach is being deployed in a biomedical consortium known as HuBMAP to define and collect metadata about a wide range of biological assays.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"265"},"PeriodicalIF":5.8,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143425803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-13DOI: 10.1038/s41597-024-04352-3
Luz F Jiménez-Segura, Daniel Restrepo-Santamaria, Juan G Ospina-Pabón, María C Castellanos-Mejía, Daniel Valencia-Rodríguez, Andrés F Galeano-Moreno, José L Londoño-López, Juliana Herrera-Pérez, Víctor M Medina-Ríos, Jonathan Álvarez-Bustamante, Manuela Mejía-Estrada, Marcela Hernández-Zapata, Luis J García-Melo, Omer Campo-Nieto, Iván D Soto-Calderón, Carlos DoNascimiento
Progress in the acquisition of massive sets of molecular data and in the bioinformatic capabilities for their processing have revolutionised species identification, filling gaps in crucial areas such as taxonomy, phylogenetic inference, biogeography, and even biodiversity conservation. Advanced DNA sequencing and metabarcoding have uncovered previously hidden diversity, although their effectiveness is highly dependent on the accuracy of reference DNA databases at local and regional scales. The compilation of information on freshwater fishes from the Magdalena River basin is an important milestone in improving our knowledge of the genetic and taxonomic diversity of a highly endemic region in the Neotropical context. Here, we share DNA data from 1,270 specimens representing 183 species, cross-referenced with complete collecting and catalogue information, along with high resolution photographs of voucher specimens when alive. This collection of multiple sources of information based on fish specimen records not only contributes to future research on Neotropical fish systematics and ecology, but also to conservation decisions in one of the South American rivers with a highest level of endemism.
{"title":"Fish databases for improving their conservation in Colombia.","authors":"Luz F Jiménez-Segura, Daniel Restrepo-Santamaria, Juan G Ospina-Pabón, María C Castellanos-Mejía, Daniel Valencia-Rodríguez, Andrés F Galeano-Moreno, José L Londoño-López, Juliana Herrera-Pérez, Víctor M Medina-Ríos, Jonathan Álvarez-Bustamante, Manuela Mejía-Estrada, Marcela Hernández-Zapata, Luis J García-Melo, Omer Campo-Nieto, Iván D Soto-Calderón, Carlos DoNascimiento","doi":"10.1038/s41597-024-04352-3","DOIUrl":"10.1038/s41597-024-04352-3","url":null,"abstract":"<p><p>Progress in the acquisition of massive sets of molecular data and in the bioinformatic capabilities for their processing have revolutionised species identification, filling gaps in crucial areas such as taxonomy, phylogenetic inference, biogeography, and even biodiversity conservation. Advanced DNA sequencing and metabarcoding have uncovered previously hidden diversity, although their effectiveness is highly dependent on the accuracy of reference DNA databases at local and regional scales. The compilation of information on freshwater fishes from the Magdalena River basin is an important milestone in improving our knowledge of the genetic and taxonomic diversity of a highly endemic region in the Neotropical context. Here, we share DNA data from 1,270 specimens representing 183 species, cross-referenced with complete collecting and catalogue information, along with high resolution photographs of voucher specimens when alive. This collection of multiple sources of information based on fish specimen records not only contributes to future research on Neotropical fish systematics and ecology, but also to conservation decisions in one of the South American rivers with a highest level of endemism.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"262"},"PeriodicalIF":5.8,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11825713/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143415099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-13DOI: 10.1038/s41597-025-04564-1
Vijai Bhadauria, Guangjun Li, Xinying Gao, Pedro Laborda
Maize leaf and sheath spot disease caused by Epicoccum sorghinum is an emerging disease of maize in China. To disentangle the molecular pathogenesis, we sequenced the genome and infection transcriptomes of the E. sorghinum strain NJC07. The genome was sequenced on Oxford Nanopore GridION and Illumina NovaSeq 6000, producing a near-complete gapless nuclear genome assembly of 32.69 Mb at 285.20-fold depth, comprising 23 contigs (including 12 full-length chromosomes) with an N50 contig number/length of 6/1.66 Mb, and a complete mitochondrial genome assembly of 61.24 kb. The nuclear genome contains 11,779 protein-coding genes, including those predicted to encode potential virulence/pathogenicity factors, such as effectors and carbohydrate-active enzymes. Temporal RNA-Seq analysis revealed that 4,058 of the 11,779 genes were induced during maize infection, with a subset potentially implicated in fungal invasion and colonization of maize plants. Together, the genomic and transcriptomic data generated in the study provide a valuable foundation for the functional analysis of virulence and pathogenicity factors, offering critical insights into the molecular mechanisms driving E. sorghinum pathogenesis on maize.
{"title":"Near-complete genome and infection transcriptomes of the maize leaf and sheath spot pathogen Epicoccum sorghinum.","authors":"Vijai Bhadauria, Guangjun Li, Xinying Gao, Pedro Laborda","doi":"10.1038/s41597-025-04564-1","DOIUrl":"10.1038/s41597-025-04564-1","url":null,"abstract":"<p><p>Maize leaf and sheath spot disease caused by Epicoccum sorghinum is an emerging disease of maize in China. To disentangle the molecular pathogenesis, we sequenced the genome and infection transcriptomes of the E. sorghinum strain NJC07. The genome was sequenced on Oxford Nanopore GridION and Illumina NovaSeq 6000, producing a near-complete gapless nuclear genome assembly of 32.69 Mb at 285.20-fold depth, comprising 23 contigs (including 12 full-length chromosomes) with an N<sub>50</sub> contig number/length of 6/1.66 Mb, and a complete mitochondrial genome assembly of 61.24 kb. The nuclear genome contains 11,779 protein-coding genes, including those predicted to encode potential virulence/pathogenicity factors, such as effectors and carbohydrate-active enzymes. Temporal RNA-Seq analysis revealed that 4,058 of the 11,779 genes were induced during maize infection, with a subset potentially implicated in fungal invasion and colonization of maize plants. Together, the genomic and transcriptomic data generated in the study provide a valuable foundation for the functional analysis of virulence and pathogenicity factors, offering critical insights into the molecular mechanisms driving E. sorghinum pathogenesis on maize.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"261"},"PeriodicalIF":5.8,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11825725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143415100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}