Mitochondrial DNA (mtDNA) mutations are critical to disease research, evolutionary studies, and lineage tracing but are challenging to analyze due to interference from nuclear mitochondrial sequences (NUMTs). Current high-throughput sequencing techniques rely on multiple primers or probes to amplify short mtDNA fragments, followed by alignment to a reference genome. However, this approach fails to mitigate NUMTs interference, leading to ambiguous results. In this study, we presented a nanopore-based third-generation sequencing (TGS) method using a single primer pair to amplify full-length mtDNA, effectively circumventing NUMTs artifacts. Sequencing was carried out on the QITAN TECH QNome-3841hex platform, generating complete mtDNA coverage for 106 samples from eight distinct family pedigrees, including complex familial structures such as half-siblings and multi-generational households. The sequencing achieved 100% genome coverage with an average mapping rate of 99.96%, supporting comprehensive genome characterization. The resulting dataset offers valuable insights into mtDNA mutation detection, mitochondrial genetics, population genetics, ancestry tracing, and forensic identification, and may advance mtDNA sequencing technologies and intergenerational studies.
{"title":"A full-length mtDNA dataset for studying genetic variations across generations and complex family structures.","authors":"Yanan Liu, Qi Yang, Yujia Xuan, Jinyuan Zhao, Anqi Chen, Suhua Zhang","doi":"10.1038/s41597-026-06824-0","DOIUrl":"https://doi.org/10.1038/s41597-026-06824-0","url":null,"abstract":"<p><p>Mitochondrial DNA (mtDNA) mutations are critical to disease research, evolutionary studies, and lineage tracing but are challenging to analyze due to interference from nuclear mitochondrial sequences (NUMTs). Current high-throughput sequencing techniques rely on multiple primers or probes to amplify short mtDNA fragments, followed by alignment to a reference genome. However, this approach fails to mitigate NUMTs interference, leading to ambiguous results. In this study, we presented a nanopore-based third-generation sequencing (TGS) method using a single primer pair to amplify full-length mtDNA, effectively circumventing NUMTs artifacts. Sequencing was carried out on the QITAN TECH QNome-3841hex platform, generating complete mtDNA coverage for 106 samples from eight distinct family pedigrees, including complex familial structures such as half-siblings and multi-generational households. The sequencing achieved 100% genome coverage with an average mapping rate of 99.96%, supporting comprehensive genome characterization. The resulting dataset offers valuable insights into mtDNA mutation detection, mitochondrial genetics, population genetics, ancestry tracing, and forensic identification, and may advance mtDNA sequencing technologies and intergenerational studies.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146195327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06829-9
Jie Wang, Jin Jin, Yuekun Fang, Liting Chen, Peng Fei
Research on CAR-T cell states is crucial for understanding the mechanisms of immunotherapy. Previous studies in live cells have been primarily limited by phototoxicity, resolution, and throughput, making it difficult to conduct further research and observations on cell states. To enable more detailed studies of cell states, we developed a microscopy imaging system with subcellular resolution, low phototoxicity, high imaging throughput, and automated data reconstruction. Using this system, we have generated and shared over 400 image sets that capture the cytotoxic effects of CAR-T cells on tumor cells. The data provide an isotropic spatial resolution of 320 nm, a temporal resolution of up to 2.5 seconds per volume, and long-term observations spanning up to 5 hours. This study reports an imaging system that fills an essential gap in the field, offers valuable insights into the cytotoxic processes of CAR-T cells, and significantly advances research in this area.
{"title":"Light sheet microscopy imaging dataset of CAR-T-cell-mediated cytotoxicity.","authors":"Jie Wang, Jin Jin, Yuekun Fang, Liting Chen, Peng Fei","doi":"10.1038/s41597-026-06829-9","DOIUrl":"https://doi.org/10.1038/s41597-026-06829-9","url":null,"abstract":"<p><p>Research on CAR-T cell states is crucial for understanding the mechanisms of immunotherapy. Previous studies in live cells have been primarily limited by phototoxicity, resolution, and throughput, making it difficult to conduct further research and observations on cell states. To enable more detailed studies of cell states, we developed a microscopy imaging system with subcellular resolution, low phototoxicity, high imaging throughput, and automated data reconstruction. Using this system, we have generated and shared over 400 image sets that capture the cytotoxic effects of CAR-T cells on tumor cells. The data provide an isotropic spatial resolution of 320 nm, a temporal resolution of up to 2.5 seconds per volume, and long-term observations spanning up to 5 hours. This study reports an imaging system that fills an essential gap in the field, offers valuable insights into the cytotoxic processes of CAR-T cells, and significantly advances research in this area.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06819-x
Meng-Chen Lee, Zhigang Deng
Analysis and generation of conversational gestures, especially in multi-party settings, remains an open challenge in many fields, due to the lack of publicly available datasets, models, and standardized evaluation metrics. To address this gap, we introduce Multi-TPC, a multimodal dataset of three-party conversations featuring synchronized speech, motion, and gaze. Multi-TPC captures rich conversational dynamics, enabling the study of interactions between multiple participants. Our statistical analysis reveals correlations between gestures and various modalities, including audio, text, and speaker identity. Our dataset and model provide a foundation for advancing research in discourse analysis, human communication dynamics, and multimodal interaction.
{"title":"Multi-TPC: A Multimodal Dataset for Three-Party Conversations with Speech, Motion, and Gaze.","authors":"Meng-Chen Lee, Zhigang Deng","doi":"10.1038/s41597-026-06819-x","DOIUrl":"https://doi.org/10.1038/s41597-026-06819-x","url":null,"abstract":"<p><p>Analysis and generation of conversational gestures, especially in multi-party settings, remains an open challenge in many fields, due to the lack of publicly available datasets, models, and standardized evaluation metrics. To address this gap, we introduce Multi-TPC, a multimodal dataset of three-party conversations featuring synchronized speech, motion, and gaze. Multi-TPC captures rich conversational dynamics, enabling the study of interactions between multiple participants. Our statistical analysis reveals correlations between gestures and various modalities, including audio, text, and speaker identity. Our dataset and model provide a foundation for advancing research in discourse analysis, human communication dynamics, and multimodal interaction.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06774-7
Luca Biferale, Fabio Bonaccorso, Niccolò Cocciaglia, Robin A Heinonen, Lorenzo Piro
Identifying the location and characteristics of pollution sources in turbulent flows is challenging, especially for environmental monitoring and emergency response, due to sparse, stochastic, and infrequent cue detection. Even in idealized settings, accurately modeling these phenomena remains highly complex, with realistic representations typically achievable only through experimental or simulation-based data. We introduce TURB-Smoke, a cutting-edge numerical dataset designed for investigating odor and contaminant dispersion in turbulent environments with and without mean wind. Generated via direct numerical simulations of the fully resolved three-dimensional Navier-Stokes equations, TURB-Smoke tracks hundreds of millions of Lagrangian particles released from five distinct point sources in fully developed turbulence, thus providing a reliable ground-truth framework for developing and evaluating source-tracking strategies using stationary sensors or mobile agents in realistic flows. Each particle's trajectory is continuously tracked on many characteristic turbulence timescales, recording both the position and the local flow velocity. Additionally, we provide coarse-grained concentration fields in 3D and in quasi-2D slabs containing the source, ideal for quickly testing and optimizing search algorithms under varying flow conditions.
{"title":"TURB-Smoke. A database of Lagrangian pollutants emitted from point sources in turbulent flows with a mean wind.","authors":"Luca Biferale, Fabio Bonaccorso, Niccolò Cocciaglia, Robin A Heinonen, Lorenzo Piro","doi":"10.1038/s41597-026-06774-7","DOIUrl":"https://doi.org/10.1038/s41597-026-06774-7","url":null,"abstract":"<p><p>Identifying the location and characteristics of pollution sources in turbulent flows is challenging, especially for environmental monitoring and emergency response, due to sparse, stochastic, and infrequent cue detection. Even in idealized settings, accurately modeling these phenomena remains highly complex, with realistic representations typically achievable only through experimental or simulation-based data. We introduce TURB-Smoke, a cutting-edge numerical dataset designed for investigating odor and contaminant dispersion in turbulent environments with and without mean wind. Generated via direct numerical simulations of the fully resolved three-dimensional Navier-Stokes equations, TURB-Smoke tracks hundreds of millions of Lagrangian particles released from five distinct point sources in fully developed turbulence, thus providing a reliable ground-truth framework for developing and evaluating source-tracking strategies using stationary sensors or mobile agents in realistic flows. Each particle's trajectory is continuously tracked on many characteristic turbulence timescales, recording both the position and the local flow velocity. Additionally, we provide coarse-grained concentration fields in 3D and in quasi-2D slabs containing the source, ideal for quickly testing and optimizing search algorithms under varying flow conditions.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06806-2
Anthony J Anderson, David Eguren, Michael A Gonzalez, Michael Caiola, Naima Khan, Sophia Watkinson, Isabella Zuccaroli, Siegfried S Hirczy, Cyrus P Zabetian, Kelly Mills, Emile Moukheiber, Laureano Moro-Velazquez, Najim Dehak, Chelsie Motley, Brittney C Muir, Ankur Butala, Kimberly Kontson
Wearable movement sensors are powerful tools for objectively characterizing and quantifying movement. They enhance the precise characterization of gait, balance, and motor symptoms in Parkinson's disease and related disorders, facilitating in-clinic and remote assessments, disease management, and therapeutic intervention development. Access to high-quality data from these sensors can accelerate discoveries in this clinical population. The WearGait-PD open-access dataset contains raw inertial measurement unit (IMU) and sensorized insole data from 100 individuals with PD and 85 age-matched controls, synchronized to a gait walkway reference system. IMU data include 3-degree of freedom (DOF) acceleration, rotational velocity, magnetic field strength, and orientation for each of 13 sensors on the participant's body. Sensor insole data include absolute pressure from 16 sensors in each insole and 3-DOF acceleration and rotational velocity. Walkway data include 2D position and relative pressure for each active sensor during every footfall. Frame-by-frame annotation of participant actions during gait and balance tasks was incorporated using synchronized video cameras. All data were associated with demographic information and clinical evaluations (e.g., medications, DBS-status, MDS-UPDRS scores).
{"title":"WearGait-PD: An Open-Access Wearables Dataset for Gait in Parkinson's Disease and Age-Matched Controls.","authors":"Anthony J Anderson, David Eguren, Michael A Gonzalez, Michael Caiola, Naima Khan, Sophia Watkinson, Isabella Zuccaroli, Siegfried S Hirczy, Cyrus P Zabetian, Kelly Mills, Emile Moukheiber, Laureano Moro-Velazquez, Najim Dehak, Chelsie Motley, Brittney C Muir, Ankur Butala, Kimberly Kontson","doi":"10.1038/s41597-026-06806-2","DOIUrl":"https://doi.org/10.1038/s41597-026-06806-2","url":null,"abstract":"<p><p>Wearable movement sensors are powerful tools for objectively characterizing and quantifying movement. They enhance the precise characterization of gait, balance, and motor symptoms in Parkinson's disease and related disorders, facilitating in-clinic and remote assessments, disease management, and therapeutic intervention development. Access to high-quality data from these sensors can accelerate discoveries in this clinical population. The WearGait-PD open-access dataset contains raw inertial measurement unit (IMU) and sensorized insole data from 100 individuals with PD and 85 age-matched controls, synchronized to a gait walkway reference system. IMU data include 3-degree of freedom (DOF) acceleration, rotational velocity, magnetic field strength, and orientation for each of 13 sensors on the participant's body. Sensor insole data include absolute pressure from 16 sensors in each insole and 3-DOF acceleration and rotational velocity. Walkway data include 2D position and relative pressure for each active sensor during every footfall. Frame-by-frame annotation of participant actions during gait and balance tasks was incorporated using synchronized video cameras. All data were associated with demographic information and clinical evaluations (e.g., medications, DBS-status, MDS-UPDRS scores).</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06837-9
Oldřich Čížek, Pavel Marhoul, Tomáš Kadlec, Oto Kaláb, Tomáš Jor, Antonín Hlaváček
Climate change is reshaping ecosystems worldwide, yet our ability to quantify its long-term impact across taxa is limited by a lack of reliable and comparable data. Here, we present a systematically collected long-term dataset spanning nearly a decade (2012-2021), documenting the diversity, abundance, and distribution of 439 moth species (Lepidoptera: Heterocera) from the Czech part of the Giant Mountains, a region entirely protected as Krkonoše National Park. Using standardised light traps, we sampled 982 localities across an area of 550 km², yielding a total of 64,776 specimens. Localities are accompanied by in-situ assessments of vegetation characteristics and management regimes, complemented by topographical derivatives and ecosystem information retrieved post-hoc from open spatial data. The dataset provides a valuable resource for investigating spatial and temporal patterns in moth diversity and abundance, as well as for evaluating the effects of different management practices, supporting both basic and applied research.
{"title":"Full-elevational gradient dataset on moth diversity and abundance in a temperate mountain range.","authors":"Oldřich Čížek, Pavel Marhoul, Tomáš Kadlec, Oto Kaláb, Tomáš Jor, Antonín Hlaváček","doi":"10.1038/s41597-026-06837-9","DOIUrl":"https://doi.org/10.1038/s41597-026-06837-9","url":null,"abstract":"<p><p>Climate change is reshaping ecosystems worldwide, yet our ability to quantify its long-term impact across taxa is limited by a lack of reliable and comparable data. Here, we present a systematically collected long-term dataset spanning nearly a decade (2012-2021), documenting the diversity, abundance, and distribution of 439 moth species (Lepidoptera: Heterocera) from the Czech part of the Giant Mountains, a region entirely protected as Krkonoše National Park. Using standardised light traps, we sampled 982 localities across an area of 550 km², yielding a total of 64,776 specimens. Localities are accompanied by in-situ assessments of vegetation characteristics and management regimes, complemented by topographical derivatives and ecosystem information retrieved post-hoc from open spatial data. The dataset provides a valuable resource for investigating spatial and temporal patterns in moth diversity and abundance, as well as for evaluating the effects of different management practices, supporting both basic and applied research.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06758-7
Juan Trujillo, Rosario Ferrer-Cascales, Miguel A Teruel, Nicolás Ruiz-Robledillo, Javier Sanchis, Sandra García-Ponsoda, Alejandro Panagiotidis-Arrizabalaga, Natalia Albaladejo-Blázquez, Ángela Martínez-Nicolás, Jorge García-Carrasco, Alejandro Reina, Ana Lavalle, Alejandro Maté, Borja Costa-López
Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder characterized by inattention, hyperactivity, and impulsivity. Current diagnostic methods rely primarily on subjective clinical evaluations, which are prone to bias. Neurophysiological techniques such as electroencephalography (EEG), eye tracking, and electrodermal activity (EDA) offer promising objective alternatives; however, their adoption is limited by the scarcity of large, public, multimodal datasets. To address this gap, we introduce the BALLADEER ADHD Dataset, a comprehensive multimodal resource that integrates simultaneous EEG, eye-tracking, and physiological signals from children and adolescents with ADHD and neurotypical controls. Data were collected through carefully designed cognitive tasks aimed at eliciting neurophysiological responses related to attentional control, response inhibition, and cognitive flexibility-key domains affected in ADHD. The dataset facilitates the development of machine learning models for ADHD classification and biomarker discovery through cross-modal analyses of EEG, eye movements, and autonomic nervous system activity. By publicly releasing this dataset, we aim to enhance transparency, reproducibility, and innovation in computational neuroscience and ADHD research.
{"title":"A Multimodal Dataset for Neurophysiological and AI Applications.","authors":"Juan Trujillo, Rosario Ferrer-Cascales, Miguel A Teruel, Nicolás Ruiz-Robledillo, Javier Sanchis, Sandra García-Ponsoda, Alejandro Panagiotidis-Arrizabalaga, Natalia Albaladejo-Blázquez, Ángela Martínez-Nicolás, Jorge García-Carrasco, Alejandro Reina, Ana Lavalle, Alejandro Maté, Borja Costa-López","doi":"10.1038/s41597-026-06758-7","DOIUrl":"https://doi.org/10.1038/s41597-026-06758-7","url":null,"abstract":"<p><p>Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder characterized by inattention, hyperactivity, and impulsivity. Current diagnostic methods rely primarily on subjective clinical evaluations, which are prone to bias. Neurophysiological techniques such as electroencephalography (EEG), eye tracking, and electrodermal activity (EDA) offer promising objective alternatives; however, their adoption is limited by the scarcity of large, public, multimodal datasets. To address this gap, we introduce the BALLADEER ADHD Dataset, a comprehensive multimodal resource that integrates simultaneous EEG, eye-tracking, and physiological signals from children and adolescents with ADHD and neurotypical controls. Data were collected through carefully designed cognitive tasks aimed at eliciting neurophysiological responses related to attentional control, response inhibition, and cognitive flexibility-key domains affected in ADHD. The dataset facilitates the development of machine learning models for ADHD classification and biomarker discovery through cross-modal analyses of EEG, eye movements, and autonomic nervous system activity. By publicly releasing this dataset, we aim to enhance transparency, reproducibility, and innovation in computational neuroscience and ADHD research.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06721-6
Sijia Feng, Aoyang Li, Rui Zhou, Klaus Butterbach-Bahl, Kaiyu Guan, Zhenong Jin, Majken C Looms, Sherrie Wang, Christian Igel, Claire Treat, Jørgen Eivind Olesen, Sheng Wang
Accurate estimation of surface soil moisture (SM) in terrestrial ecosystems is essential for understanding hydroclimate dynamics. The L-band Soil Moisture Active Passive (SMAP) mission provides 9-km global daily surface SM by using a microwave radiative transfer model (RTM)-based algorithm. However, the accuracy of SMAP SM is limited in regions with dense vegetation cover and complex surface conditions, due to the empirical parameterization and oversimplified radiative transfer processes. To overcome the limitations, we developed a Process-Guided Machine Learning (PGML) framework to integrate RTM theories and deep learning to predict global daily surface 9-km SM from April 2015 to June 2025. Informed by domain knowledge, we developed the PGML model structure using RTM and hydrological theories, designed a Kling-Gupta efficiency-based cost function, pretrained it with RTM simulations, and fine-tuned it with in-situ measurements. The independent validation shows that PGML SM has strong agreement with in-situ measurements (R = 0.868 and unbiased RMSE = 0.054 m3/m3). This study highlights the potential of PGML to enhance the accuracy of satellite SM, thereby supporting improved water resources and ecosystem management.
{"title":"Global daily 9 km remotely sensed soil moisture (2015-2025) with microwave radiative transfer-guided learning.","authors":"Sijia Feng, Aoyang Li, Rui Zhou, Klaus Butterbach-Bahl, Kaiyu Guan, Zhenong Jin, Majken C Looms, Sherrie Wang, Christian Igel, Claire Treat, Jørgen Eivind Olesen, Sheng Wang","doi":"10.1038/s41597-026-06721-6","DOIUrl":"https://doi.org/10.1038/s41597-026-06721-6","url":null,"abstract":"<p><p>Accurate estimation of surface soil moisture (SM) in terrestrial ecosystems is essential for understanding hydroclimate dynamics. The L-band Soil Moisture Active Passive (SMAP) mission provides 9-km global daily surface SM by using a microwave radiative transfer model (RTM)-based algorithm. However, the accuracy of SMAP SM is limited in regions with dense vegetation cover and complex surface conditions, due to the empirical parameterization and oversimplified radiative transfer processes. To overcome the limitations, we developed a Process-Guided Machine Learning (PGML) framework to integrate RTM theories and deep learning to predict global daily surface 9-km SM from April 2015 to June 2025. Informed by domain knowledge, we developed the PGML model structure using RTM and hydrological theories, designed a Kling-Gupta efficiency-based cost function, pretrained it with RTM simulations, and fine-tuned it with in-situ measurements. The independent validation shows that PGML SM has strong agreement with in-situ measurements (R = 0.868 and unbiased RMSE = 0.054 m<sup>3</sup>/m<sup>3</sup>). This study highlights the potential of PGML to enhance the accuracy of satellite SM, thereby supporting improved water resources and ecosystem management.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06794-3
Kevin Varga, Charles Jones
Live fuel moisture content (LFMC) strongly affects the behavior of wildland fire, resulting in its incorporation into wildfire spread models and danger ratings. In this study, over ten thousand LFMC observations are combined with predictor variables from Landsat imagery and the Weather Research and Forecasting model to train species-specific random forest models that predict the LFMC of four fuel types-chamise, old growth chamise, black sage, and bigpod ceanothus. These models are then utilized to create a historical, 32-year long, LFMC dataset in southern California chaparral. Additionally, the high spatial and temporal sampling frequency of chamise allowed for quantile mapping bias correction to be applied. The final chamise output, which is the most robust, has a mean absolute error of 9.68% and an R2 value of 0.76. The LFMC dataset successfully captures the variability in the annual cycle, the spatial heterogeneity, and the interspecies differences, which makes it applicable for better understanding varying fire season characteristics and landscape level flammability.
{"title":"A 32-year species-specific live fuel moisture content dataset for southern California chaparral.","authors":"Kevin Varga, Charles Jones","doi":"10.1038/s41597-026-06794-3","DOIUrl":"https://doi.org/10.1038/s41597-026-06794-3","url":null,"abstract":"<p><p>Live fuel moisture content (LFMC) strongly affects the behavior of wildland fire, resulting in its incorporation into wildfire spread models and danger ratings. In this study, over ten thousand LFMC observations are combined with predictor variables from Landsat imagery and the Weather Research and Forecasting model to train species-specific random forest models that predict the LFMC of four fuel types-chamise, old growth chamise, black sage, and bigpod ceanothus. These models are then utilized to create a historical, 32-year long, LFMC dataset in southern California chaparral. Additionally, the high spatial and temporal sampling frequency of chamise allowed for quantile mapping bias correction to be applied. The final chamise output, which is the most robust, has a mean absolute error of 9.68% and an R<sup>2</sup> value of 0.76. The LFMC dataset successfully captures the variability in the annual cycle, the spatial heterogeneity, and the interspecies differences, which makes it applicable for better understanding varying fire season characteristics and landscape level flammability.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The pronounced heterogeneity of the tumor microenvironment (TME) in colorectal cancer (CRC) presents major obstacles in accurately predicting patient outcomes and tailoring treatment responses. Deciphering this intricate microenvironment based on histological images and classifying it into well-defined tissue components is critical for optimizing clinical interventions. Although deep learning (DL) has advanced substantially in medical imaging analysis, its application in CRC remains limited due to a shortage of comprehensively annotated datasets and large-scale, high-quality histological images. To address this gap, we present HMU-CRC-Hist550K, a curated dataset comprising 550,000 annotated image tiles derived from 500 whole-slide images, fully labeled into eight distinct TME tissue classes. The dataset represents a broad collection of publicly available CRC histology samples. Additionally, we demonstrate the utility of this resource by benchmarking three DL models on tissue segmentation tasks. HMU-CRC-Hist550K offers a valuable foundation for TME profiling, AI-assisted diagnosis, molecular subtype inference, and individualized therapy planning, while also enabling new research directions in modeling the spatial-temporal evolution of the TME.
{"title":"Large-Scale Histological Image Dataset with Metadata for Colorectal Cancer Microenvironment.","authors":"Hao Wang, Huiying Li, Jingmin Xue, Yang Jiang, Keru Ma, Fenqi Du, Genshen Mo, Hao Li, Yuze Huang, Haonan Xie, Hongxue Meng, Peng Han, Shenghan Lou","doi":"10.1038/s41597-026-06675-9","DOIUrl":"https://doi.org/10.1038/s41597-026-06675-9","url":null,"abstract":"<p><p>The pronounced heterogeneity of the tumor microenvironment (TME) in colorectal cancer (CRC) presents major obstacles in accurately predicting patient outcomes and tailoring treatment responses. Deciphering this intricate microenvironment based on histological images and classifying it into well-defined tissue components is critical for optimizing clinical interventions. Although deep learning (DL) has advanced substantially in medical imaging analysis, its application in CRC remains limited due to a shortage of comprehensively annotated datasets and large-scale, high-quality histological images. To address this gap, we present HMU-CRC-Hist550K, a curated dataset comprising 550,000 annotated image tiles derived from 500 whole-slide images, fully labeled into eight distinct TME tissue classes. The dataset represents a broad collection of publicly available CRC histology samples. Additionally, we demonstrate the utility of this resource by benchmarking three DL models on tissue segmentation tasks. HMU-CRC-Hist550K offers a valuable foundation for TME profiling, AI-assisted diagnosis, molecular subtype inference, and individualized therapy planning, while also enabling new research directions in modeling the spatial-temporal evolution of the TME.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}