Pub Date : 2026-03-03DOI: 10.1038/s41597-026-06743-0
Ian W Housman, Sean P Healey, Joshua Heyer, Elizabeth Hardwick, Zhiqiang Yang, Jennifer Ross, Kevin Megown
Maps of land cover class are more common, and generally more accurate, than maps of land use because "use" implies management intent that may not be directly sensible by earth-observing satellites. However, many monitoring frameworks related to sustainability require land use and land cover to be explicitly differentiated. This is particularly true for forests, where natural and human-caused dynamics in tree cover often occur independently of long-term land use changes that signal deforestation. We used an extensive multi-temporal, multi-variate sample of reference points across the United States to calibrate and validate 30 m mapped time series (1985-present) of land cover, land use, and vegetation condition change. These maps comprise the Landscape Change Monitoring System (LCMS) and are served through: an interactive, open-access app; Google Earth Engine; image services; and the FSGeodata Clearinghouse. Here, we provide methods, validation metrics, and a usage example highlighting the value of differentiating use from cover in the context of model-assisted estimation of forest area using U.S. Department of Agriculture, Forest Service inventory data.
{"title":"Coincident maps of changing land cover, land use, and forest condition in the United States, 1985-present.","authors":"Ian W Housman, Sean P Healey, Joshua Heyer, Elizabeth Hardwick, Zhiqiang Yang, Jennifer Ross, Kevin Megown","doi":"10.1038/s41597-026-06743-0","DOIUrl":"https://doi.org/10.1038/s41597-026-06743-0","url":null,"abstract":"<p><p>Maps of land cover class are more common, and generally more accurate, than maps of land use because \"use\" implies management intent that may not be directly sensible by earth-observing satellites. However, many monitoring frameworks related to sustainability require land use and land cover to be explicitly differentiated. This is particularly true for forests, where natural and human-caused dynamics in tree cover often occur independently of long-term land use changes that signal deforestation. We used an extensive multi-temporal, multi-variate sample of reference points across the United States to calibrate and validate 30 m mapped time series (1985-present) of land cover, land use, and vegetation condition change. These maps comprise the Landscape Change Monitoring System (LCMS) and are served through: an interactive, open-access app; Google Earth Engine; image services; and the FSGeodata Clearinghouse. Here, we provide methods, validation metrics, and a usage example highlighting the value of differentiating use from cover in the context of model-assisted estimation of forest area using U.S. Department of Agriculture, Forest Service inventory data.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147349084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1038/s41597-026-06648-y
Nitya Mittal, Sebastian Vollmer
The data collected for this study focuses on two research question. First, it examines the effectiveness of a portable saving device in reducing temptation spending and increasing savings using a Randomised Control Trial (RCT) design. We then build on the data collected for RCT among slum dwellers in Pune, India and expand the scope of data collection to examine the long-term effect of the COVID-19 pandemic on livelihoods and consumption expenditure. Detailed information on income, savings, expenditure, knowledge about and behaviour during the pandemic was collected during various rounds. Additional information on female empowerment, decision making within the household and behavioural parameters was also collected. Four rounds of data were collected - two rounds before COVID-19 in 2018 and 2019 through field interviews, and two rounds in 2020 and 2022 through phone interviews. The baseline sample consisted of 1525 slum dwellers who earned above subsistence level income in Pune, and we have a balanced panel of 411 individuals.
{"title":"Savings behaviour and livelihoods before and after COVID-19 - a four round panel dataset from Pune, India.","authors":"Nitya Mittal, Sebastian Vollmer","doi":"10.1038/s41597-026-06648-y","DOIUrl":"10.1038/s41597-026-06648-y","url":null,"abstract":"<p><p>The data collected for this study focuses on two research question. First, it examines the effectiveness of a portable saving device in reducing temptation spending and increasing savings using a Randomised Control Trial (RCT) design. We then build on the data collected for RCT among slum dwellers in Pune, India and expand the scope of data collection to examine the long-term effect of the COVID-19 pandemic on livelihoods and consumption expenditure. Detailed information on income, savings, expenditure, knowledge about and behaviour during the pandemic was collected during various rounds. Additional information on female empowerment, decision making within the household and behavioural parameters was also collected. Four rounds of data were collected - two rounds before COVID-19 in 2018 and 2019 through field interviews, and two rounds in 2020 and 2022 through phone interviews. The baseline sample consisted of 1525 slum dwellers who earned above subsistence level income in Pune, and we have a balanced panel of 411 individuals.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12957331/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147345040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1038/s41597-026-06552-5
Sujit Roy, Dinesha V Hegde, Johannes Schmude, Rohit Lal, Vishal Gaur, Amy Lin, Kshitiz Mandal, Talwinder Singh, Andrés Muñoz-Jaramillo, Kang Yang, Chetraj Pandey, Jinsu Hong, Berkay Aydin, Ryan McGranaghan, Spiridon Kasapis, Vishal Upendran, Shah Bahauddin, Daniel da Silva, Marcus Freitag, Iksha Gurung, Nikolai Pogorelov, Campbell Watson, Manil Maskey, Juan Bernabe-Moreno, Rahul Ramachandran
This paper introduces a high resolution, machine learning-ready heliophysics dataset derived from NASA's Solar Dynamics Observatory (SDO), specifically designed to advance machine learning (ML) applications in solar physics and space weather forecasting. The dataset includes processed imagery from the Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager (HMI), spanning a solar cycle from May 2010 to December 2024. To ensure suitability for ML tasks, the data has been preprocessed, including correction of spacecraft roll angles, orbital adjustments, exposure normalization, and degradation compensation. We also provide auxiliary application benchmark datasets complementing the core SDO dataset. These provide benchmark applications for central heliophysics and space weather tasks such as active region segmentation, active region emergence forecasting, coronal field extrapolation, solar flare prediction, solar Extreme Ultraviolet (EUV) spectra prediction, and solar wind speed estimation. By establishing a unified, standardized data collection, this dataset aims to facilitate benchmarking, enhance reproducibility, and accelerate the development of AI-driven models for critical space weather prediction tasks, bridging gaps between solar physics, machine learning, and operational forecasting.
{"title":"SuryaBench: Benchmark Dataset for Advancing Machine Learning in Heliophysics and Space Weather Prediction.","authors":"Sujit Roy, Dinesha V Hegde, Johannes Schmude, Rohit Lal, Vishal Gaur, Amy Lin, Kshitiz Mandal, Talwinder Singh, Andrés Muñoz-Jaramillo, Kang Yang, Chetraj Pandey, Jinsu Hong, Berkay Aydin, Ryan McGranaghan, Spiridon Kasapis, Vishal Upendran, Shah Bahauddin, Daniel da Silva, Marcus Freitag, Iksha Gurung, Nikolai Pogorelov, Campbell Watson, Manil Maskey, Juan Bernabe-Moreno, Rahul Ramachandran","doi":"10.1038/s41597-026-06552-5","DOIUrl":"https://doi.org/10.1038/s41597-026-06552-5","url":null,"abstract":"<p><p>This paper introduces a high resolution, machine learning-ready heliophysics dataset derived from NASA's Solar Dynamics Observatory (SDO), specifically designed to advance machine learning (ML) applications in solar physics and space weather forecasting. The dataset includes processed imagery from the Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager (HMI), spanning a solar cycle from May 2010 to December 2024. To ensure suitability for ML tasks, the data has been preprocessed, including correction of spacecraft roll angles, orbital adjustments, exposure normalization, and degradation compensation. We also provide auxiliary application benchmark datasets complementing the core SDO dataset. These provide benchmark applications for central heliophysics and space weather tasks such as active region segmentation, active region emergence forecasting, coronal field extrapolation, solar flare prediction, solar Extreme Ultraviolet (EUV) spectra prediction, and solar wind speed estimation. By establishing a unified, standardized data collection, this dataset aims to facilitate benchmarking, enhance reproducibility, and accelerate the development of AI-driven models for critical space weather prediction tasks, bridging gaps between solar physics, machine learning, and operational forecasting.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147345066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1038/s41597-026-06945-6
Liangqiu Chen, Shiyong Li, Lei Liu, Guanfeng Huang, Yan Zhuang, Zhencheng Chen, Yongbo Liang, Mohamed Elgendi
Hemoglobin (Hb) concentration is a fundamental physiological marker widely used in the diagnosis of anemia and the assessment of cardiovascular health. Although invasive blood testing provides high accuracy, its reliance on laboratory infrastructure limits scalability and real-time applicability. Here, we present Hb-PPG, a four-wavelength photoplethysmography (PPG) dataset designed to support research on non-invasive hemoglobin assessment and cardiovascular monitoring. The dataset comprises 1008 PPG signal segments acquired at 660, 730, 850, and 940 nm from 252 adult subjects, alongside reference measurements of hemoglobin, fasting blood glucose, and brachial artery systolic and diastolic blood pressure. Hb-PPG enables systematic investigation of wavelength-dependent PPG signal characteristics and their relationships with hematological and hemodynamic parameters. By providing high-quality, multi-wavelength optical signals with clinically grounded reference data, this dataset facilitates the development, validation, and benchmarking of non-invasive approaches for hemoglobin estimation and related vascular health applications. The dataset is intended to support algorithm development, benchmarking, and methodological studies in non-invasive hemoglobin estimation, rather than direct clinical diagnosis.
{"title":"A Four-Wavelength Photoplethysmography dataset for non-invasive hemoglobin assessment.","authors":"Liangqiu Chen, Shiyong Li, Lei Liu, Guanfeng Huang, Yan Zhuang, Zhencheng Chen, Yongbo Liang, Mohamed Elgendi","doi":"10.1038/s41597-026-06945-6","DOIUrl":"https://doi.org/10.1038/s41597-026-06945-6","url":null,"abstract":"<p><p>Hemoglobin (Hb) concentration is a fundamental physiological marker widely used in the diagnosis of anemia and the assessment of cardiovascular health. Although invasive blood testing provides high accuracy, its reliance on laboratory infrastructure limits scalability and real-time applicability. Here, we present Hb-PPG, a four-wavelength photoplethysmography (PPG) dataset designed to support research on non-invasive hemoglobin assessment and cardiovascular monitoring. The dataset comprises 1008 PPG signal segments acquired at 660, 730, 850, and 940 nm from 252 adult subjects, alongside reference measurements of hemoglobin, fasting blood glucose, and brachial artery systolic and diastolic blood pressure. Hb-PPG enables systematic investigation of wavelength-dependent PPG signal characteristics and their relationships with hematological and hemodynamic parameters. By providing high-quality, multi-wavelength optical signals with clinically grounded reference data, this dataset facilitates the development, validation, and benchmarking of non-invasive approaches for hemoglobin estimation and related vascular health applications. The dataset is intended to support algorithm development, benchmarking, and methodological studies in non-invasive hemoglobin estimation, rather than direct clinical diagnosis.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147327055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Craigia yunnanensis, endemic to East Asia, is an endangered species with important economic and scientific research values. However, the absence of a reference genome has hindered studies on genetic variation and conservation management of C. yunnanensis. To address this gap, we present a high-quality chromosome-level genome sequence of C. yunnanensis by using PacBio HiFi sequencing and Hi-C scaffolding. The genome has a total length of 1,618.96 Mb with scaffold N50 of 39.39 Mb and 98.00% of the genome assigned to 41 chromosomes. BUSCO assessment yielded a completeness score of 99.40%. Furthermore, we predicted 58,969 proteincoding genes, and 94.09% of them was functionally annotated. Assembly of the C. yunnanensis genome facilitates a deeper understanding of adaptive evolution in Craigia, knowledge that is fundamental to promoting the conservation and enabling evidence-based management of this endangered plant.
{"title":"A chromosome-level reference genome of an endangered plant Craigia yunnanensis.","authors":"Zhuo Cheng, Yuanyuan Xing, Yiming Pan, Jue Wang, Xinxin Wu, Jiahua Li, Congli Xu, Ren-Ai Xu, Fangfang Xia, Zhong Liu, Chunlin Long","doi":"10.1038/s41597-026-06746-x","DOIUrl":"https://doi.org/10.1038/s41597-026-06746-x","url":null,"abstract":"<p><p>Craigia yunnanensis, endemic to East Asia, is an endangered species with important economic and scientific research values. However, the absence of a reference genome has hindered studies on genetic variation and conservation management of C. yunnanensis. To address this gap, we present a high-quality chromosome-level genome sequence of C. yunnanensis by using PacBio HiFi sequencing and Hi-C scaffolding. The genome has a total length of 1,618.96 Mb with scaffold N50 of 39.39 Mb and 98.00% of the genome assigned to 41 chromosomes. BUSCO assessment yielded a completeness score of 99.40%. Furthermore, we predicted 58,969 proteincoding genes, and 94.09% of them was functionally annotated. Assembly of the C. yunnanensis genome facilitates a deeper understanding of adaptive evolution in Craigia, knowledge that is fundamental to promoting the conservation and enabling evidence-based management of this endangered plant.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147344977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1038/s41597-026-06741-2
Chelsea A Southworth, Jack C Winans, Jacob B Gordon, Niki H Learn, William A Wilber, Catherine Andreadis, Gretchen Andreasen, Mimi Arandjelovic, C Ryan Campbell, Mary N Chege, Maria J A Creighton, Carmen M Cromer, Reena Debray, Carly C Dickson, Pamela Ferretti, Elizabeth M George, Laurence R Gesquiere, Shuyu He, Leif Hey, Emily E Jefferson, Ipek G Kulahci, Brian A Lerch, Lee Nonnamaker, Iker Rivas-González, Beniamino Tuliozi, Shasta E Webb, Susan C Alberts, Elizabeth A Archie, Jenny Tung
Long-term data sets on individually recognized animals and their environments are critical to understanding animal behavior, evolution, and ecology. However, they are resource- and time-intensive and seldom made publicly available. The Amboseli Baboon Research Project (ABRP) is one of the longest-running studies of a wild mammal population in the world and has collected extensive data on the baboon population of the Amboseli ecosystem in Kenya since 1971. Here, we describe four ABRP data sets newly available to the evolutionary biology, behavioral ecology, and primatology communities: (1) the sizes and demographic compositions of 21 social groups from 1971-2023; (2) the activity budgets of adult females and immatures from 1984-2023; (3) behavioral data on diet for adult females and immatures from 1984-2023; and (4) weather data, including precipitation from 1976-2023 and temperature from 1976-2022. Data are aggregated annually and monthly to enable cross-data set analyses. These data offer a rare longitudinal perspective on behavioral and ecological change in a wild mammal population.
{"title":"Demographic, behavioral, and ecological data from a long-term field study of wild baboons in Amboseli, Kenya.","authors":"Chelsea A Southworth, Jack C Winans, Jacob B Gordon, Niki H Learn, William A Wilber, Catherine Andreadis, Gretchen Andreasen, Mimi Arandjelovic, C Ryan Campbell, Mary N Chege, Maria J A Creighton, Carmen M Cromer, Reena Debray, Carly C Dickson, Pamela Ferretti, Elizabeth M George, Laurence R Gesquiere, Shuyu He, Leif Hey, Emily E Jefferson, Ipek G Kulahci, Brian A Lerch, Lee Nonnamaker, Iker Rivas-González, Beniamino Tuliozi, Shasta E Webb, Susan C Alberts, Elizabeth A Archie, Jenny Tung","doi":"10.1038/s41597-026-06741-2","DOIUrl":"10.1038/s41597-026-06741-2","url":null,"abstract":"<p><p>Long-term data sets on individually recognized animals and their environments are critical to understanding animal behavior, evolution, and ecology. However, they are resource- and time-intensive and seldom made publicly available. The Amboseli Baboon Research Project (ABRP) is one of the longest-running studies of a wild mammal population in the world and has collected extensive data on the baboon population of the Amboseli ecosystem in Kenya since 1971. Here, we describe four ABRP data sets newly available to the evolutionary biology, behavioral ecology, and primatology communities: (1) the sizes and demographic compositions of 21 social groups from 1971-2023; (2) the activity budgets of adult females and immatures from 1984-2023; (3) behavioral data on diet for adult females and immatures from 1984-2023; and (4) weather data, including precipitation from 1976-2023 and temperature from 1976-2022. Data are aggregated annually and monthly to enable cross-data set analyses. These data offer a rare longitudinal perspective on behavioral and ecological change in a wild mammal population.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"13 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12953632/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147345025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1038/s41597-026-06922-z
Qunnan Qiu, Zhe Liu, Yuqing Huang, Huilin Pang, Liuyang Li, Min Zhu, Xiaolong Hu, Chengliang Gong
DNA methylation, as well as histone modifications, is an important regulatory mechanism for altered gene expressions. Our previous study has shown that Bombyx mori cytoplasmic polyhedrosis virus (BmCPV) infection could change the level of trimethylation of lysine 9 of histone 3 (H3K9me3) and acetylation of lysine 9 of histone 3 (H3K9ac), thus regulating the mRNAs expressions in the midgut of silkworm, B. mori. However, the correlation between genome-scale DNA methylome and transcriptome remains underexplored. In this study, whole genome bisulfite sequencing (WGBS) was performed on the midgut of BmCPV-infected silkworms at 48 h and 96 h post infection, and corresponding midguts of uninfected silkworms. Above analysis will contribute to further understanding how BmCPV regulate gene expression through epigenetic modification at the genome-wide level.
DNA甲基化以及组蛋白修饰是基因表达改变的重要调控机制。我们前期的研究表明,家蚕细胞质多角体病毒(Bombyx mori cytoplasmic polyhedrosis virus, BmCPV)感染可改变组蛋白3赖氨酸9 (H3K9me3)三甲基化水平和组蛋白3赖氨酸9 (H3K9ac)乙酰化水平,从而调控家蚕中肠mrna的表达。然而,基因组尺度DNA甲基组和转录组之间的相关性仍未得到充分研究。本研究对感染bmcpvv的家蚕在感染后48 h和96 h的中肠以及相应的未感染家蚕的中肠进行了全基因组亚硫酸盐测序(WGBS)。以上分析将有助于进一步了解BmCPV如何在全基因组水平上通过表观遗传修饰调控基因表达。
{"title":"Genome-scale DNA methylome and transcriptome profiling of midgut of Bombyx mori infected with BmCPV.","authors":"Qunnan Qiu, Zhe Liu, Yuqing Huang, Huilin Pang, Liuyang Li, Min Zhu, Xiaolong Hu, Chengliang Gong","doi":"10.1038/s41597-026-06922-z","DOIUrl":"https://doi.org/10.1038/s41597-026-06922-z","url":null,"abstract":"<p><p>DNA methylation, as well as histone modifications, is an important regulatory mechanism for altered gene expressions. Our previous study has shown that Bombyx mori cytoplasmic polyhedrosis virus (BmCPV) infection could change the level of trimethylation of lysine 9 of histone 3 (H3K9me3) and acetylation of lysine 9 of histone 3 (H3K9ac), thus regulating the mRNAs expressions in the midgut of silkworm, B. mori. However, the correlation between genome-scale DNA methylome and transcriptome remains underexplored. In this study, whole genome bisulfite sequencing (WGBS) was performed on the midgut of BmCPV-infected silkworms at 48 h and 96 h post infection, and corresponding midguts of uninfected silkworms. Above analysis will contribute to further understanding how BmCPV regulate gene expression through epigenetic modification at the genome-wide level.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147344963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1038/s41597-026-06827-x
Joseph Giovanelli, Matteo Magnini, Giovanni Ciatto, Angel S Marrero, Andrea Borghesi, Gustavo A Marrero, Roberta Calegari
This paper introduces a novel benchmark dataset designed to support fairness-oriented research in artificial intelligence within the educational domain. The dataset originates from longitudinal survey data collected by the Agencia Canaria de Calidad Universitaria y Evaluación Educativa, encompassing comprehensive information from students, families, and teachers across the Canary Islands, Spain. It includes detailed student profiles and academic trajectories, covering multiple years of academic performance outcomes. The original data is characterised by a high-dimensional and sparse feature space, which presents challenges for direct application in AI workflows. To address these challenges while minimising the risk of introducing bias during preprocessing, we provide a curated version of the dataset specifically tailored for AI applications. This version preserves the statistical properties of the original data and is accompanied by detailed documentation of the preprocessing steps, including strategies for dimensionality reduction and fairness preservation. The dataset is intended as a resource for the research community, enabling studies on fairness, predictive modeling, and educational analytics. We describe its structure, content, and preparation process.
本文介绍了一个新的基准数据集,旨在支持人工智能在教育领域的公平导向研究。该数据集来源于Canaria de Calidad Universitaria通过Evaluación Educativa收集的纵向调查数据,涵盖了西班牙加那利群岛学生、家庭和教师的综合信息。它包括详细的学生概况和学习轨迹,涵盖多年的学习成绩结果。原始数据具有高维和稀疏的特征空间,这给人工智能工作流的直接应用带来了挑战。为了应对这些挑战,同时最大限度地减少预处理过程中引入偏见的风险,我们提供了专门为人工智能应用量身定制的数据集的策划版本。该版本保留了原始数据的统计属性,并附有预处理步骤的详细文档,包括降维和保持公平性的策略。该数据集旨在作为研究界的资源,使公平,预测建模和教育分析的研究成为可能。我们描述了它的结构、内容和准备过程。
{"title":"Unfair Inequality in Education: A Benchmark for AI-Fairness Research.","authors":"Joseph Giovanelli, Matteo Magnini, Giovanni Ciatto, Angel S Marrero, Andrea Borghesi, Gustavo A Marrero, Roberta Calegari","doi":"10.1038/s41597-026-06827-x","DOIUrl":"https://doi.org/10.1038/s41597-026-06827-x","url":null,"abstract":"<p><p>This paper introduces a novel benchmark dataset designed to support fairness-oriented research in artificial intelligence within the educational domain. The dataset originates from longitudinal survey data collected by the Agencia Canaria de Calidad Universitaria y Evaluación Educativa, encompassing comprehensive information from students, families, and teachers across the Canary Islands, Spain. It includes detailed student profiles and academic trajectories, covering multiple years of academic performance outcomes. The original data is characterised by a high-dimensional and sparse feature space, which presents challenges for direct application in AI workflows. To address these challenges while minimising the risk of introducing bias during preprocessing, we provide a curated version of the dataset specifically tailored for AI applications. This version preserves the statistical properties of the original data and is accompanied by detailed documentation of the preprocessing steps, including strategies for dimensionality reduction and fairness preservation. The dataset is intended as a resource for the research community, enabling studies on fairness, predictive modeling, and educational analytics. We describe its structure, content, and preparation process.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147344997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1038/s41597-026-06908-x
Erwan Le Floch, Anne-Françoise Adam-Blondon, Michael Alaux, Etienne Bardet, Noor Bas, Filippo M Bassi, Maja Boczkowska, Paulina Bolc, Matthijs Brouwer, Boulos Chalhoub, Reinhoud De Blok, Gergana Desheva, Jagadeeshwar R Etukala, Raphaël Flores, Indira Galit, Wouter Groenink, Rene Hauptvogel, Roel Hoekstra, Zakaria Kehel, Paul Kersey, Renata Kowalik, Suman Kumar, Bozhidar Kyosev, Matthias Lange, Cătălin Lazăr, Cristina Marinciu, Diana Martín-Lammerding, Adrian Motor, Mounika Pachipala, Mercedes Pallero-Baena, Eugen Petcu, Aleksandra Pietrusińska-Radzio, Wiesław Podyma, Cyril Pommier, Marta Puchta-Jasińska, Szymon Puła, Laura Reiniers, Joseph Ruff, Magdalena Ruiz, Francesca Sansoni, Beate Schierscher, Gabriela Șerban, Sarah Serex, Patrizia Vaccino, Robbert Van Treuren, Mandea Vasile, Liliana Vasilescu, Andrea Visioni, Stephan Weise, Erik Wijnker, Meryem Zaim, Jochen C Reif, Marcel O Berkner
Plant genetic resources are considered a treasure trove of valuable, untapped diversity that holds the key to breeding the crops of the future. However, the use of these resources in breeding is often limited due to the lack of comprehensive phenotypic characterization. The present study provides extensive historical phenotypic data from nine genebanks as a MIAPPE compliant data set. We compiled and curated phenotypic data from 43,293 wheat accessions, encompassing 460,399 data points across 52 traits, including the three core traits of plant height, heading time, and thousand kernel weight from seven decades. The exceptional quality of the presented dataset was highlighted by predominantly high heritabilities. Phenotypic data of such quantity and quality is a crucial resource for unlocking the valuable diversity of plant genetic resources for agricultural advancement.
{"title":"Wheat historical phenotypic data from European genebanks as an important resource for research and breeding.","authors":"Erwan Le Floch, Anne-Françoise Adam-Blondon, Michael Alaux, Etienne Bardet, Noor Bas, Filippo M Bassi, Maja Boczkowska, Paulina Bolc, Matthijs Brouwer, Boulos Chalhoub, Reinhoud De Blok, Gergana Desheva, Jagadeeshwar R Etukala, Raphaël Flores, Indira Galit, Wouter Groenink, Rene Hauptvogel, Roel Hoekstra, Zakaria Kehel, Paul Kersey, Renata Kowalik, Suman Kumar, Bozhidar Kyosev, Matthias Lange, Cătălin Lazăr, Cristina Marinciu, Diana Martín-Lammerding, Adrian Motor, Mounika Pachipala, Mercedes Pallero-Baena, Eugen Petcu, Aleksandra Pietrusińska-Radzio, Wiesław Podyma, Cyril Pommier, Marta Puchta-Jasińska, Szymon Puła, Laura Reiniers, Joseph Ruff, Magdalena Ruiz, Francesca Sansoni, Beate Schierscher, Gabriela Șerban, Sarah Serex, Patrizia Vaccino, Robbert Van Treuren, Mandea Vasile, Liliana Vasilescu, Andrea Visioni, Stephan Weise, Erik Wijnker, Meryem Zaim, Jochen C Reif, Marcel O Berkner","doi":"10.1038/s41597-026-06908-x","DOIUrl":"https://doi.org/10.1038/s41597-026-06908-x","url":null,"abstract":"<p><p>Plant genetic resources are considered a treasure trove of valuable, untapped diversity that holds the key to breeding the crops of the future. However, the use of these resources in breeding is often limited due to the lack of comprehensive phenotypic characterization. The present study provides extensive historical phenotypic data from nine genebanks as a MIAPPE compliant data set. We compiled and curated phenotypic data from 43,293 wheat accessions, encompassing 460,399 data points across 52 traits, including the three core traits of plant height, heading time, and thousand kernel weight from seven decades. The exceptional quality of the presented dataset was highlighted by predominantly high heritabilities. Phenotypic data of such quantity and quality is a crucial resource for unlocking the valuable diversity of plant genetic resources for agricultural advancement.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147345013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1038/s41597-026-06554-3
Guiliang Xin, Gang Wang, Bobin Liu, Daizhen Zhang, Boping Tang, Chuanyuan Deng, Lie Wang
Bischofia polycarpa (2n = 68), belonging to Phyllanthaceae family, is a native deciduous tree with naturally distribution ranging from southern Qinling Mountains and Huaihe River basin to the northern regions of Fujian and Guangdong, China. It holds significant horticultural, ornamental, and medicinal value and serves as a crucial winter food resource for wild birds. Herein, we report a de novo genome assembly for B. polycarpa, utilizing a combination of PacBio HiFi Reads and Hi-C data. In total, the genome size reaches 585.68 Mb with a contig N50 of 12.62 Mb, and 99.06% (580.18 Mb) of the assembly successfully anchored on 34 chromosomes. The genome comprises approximately 62.77% repetitive sequences and 32,554 protein-coding genes, of which 96.15% could be functionally annotated. The BUSCO analysis reveals a genome completeness of 95.42% (n = 1,540), including 1,499 (92.87%) single-copy BUSCOs and 41 (2.54%) duplicated BUSCOs. This high-quality genome of the Phyllanthaceae enriches our understanding of the genetic underpinnings of plant reproductive ecology.
{"title":"The chromosome-scale genome assembly, annotation of Bischofia polycarpa (H. Lév.) Airy Shaw, Phyllanthaceae.","authors":"Guiliang Xin, Gang Wang, Bobin Liu, Daizhen Zhang, Boping Tang, Chuanyuan Deng, Lie Wang","doi":"10.1038/s41597-026-06554-3","DOIUrl":"https://doi.org/10.1038/s41597-026-06554-3","url":null,"abstract":"<p><p>Bischofia polycarpa (2n = 68), belonging to Phyllanthaceae family, is a native deciduous tree with naturally distribution ranging from southern Qinling Mountains and Huaihe River basin to the northern regions of Fujian and Guangdong, China. It holds significant horticultural, ornamental, and medicinal value and serves as a crucial winter food resource for wild birds. Herein, we report a de novo genome assembly for B. polycarpa, utilizing a combination of PacBio HiFi Reads and Hi-C data. In total, the genome size reaches 585.68 Mb with a contig N50 of 12.62 Mb, and 99.06% (580.18 Mb) of the assembly successfully anchored on 34 chromosomes. The genome comprises approximately 62.77% repetitive sequences and 32,554 protein-coding genes, of which 96.15% could be functionally annotated. The BUSCO analysis reveals a genome completeness of 95.42% (n = 1,540), including 1,499 (92.87%) single-copy BUSCOs and 41 (2.54%) duplicated BUSCOs. This high-quality genome of the Phyllanthaceae enriches our understanding of the genetic underpinnings of plant reproductive ecology.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147327016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}