Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06806-2
Anthony J Anderson, David Eguren, Michael A Gonzalez, Michael Caiola, Naima Khan, Sophia Watkinson, Isabella Zuccaroli, Siegfried S Hirczy, Cyrus P Zabetian, Kelly Mills, Emile Moukheiber, Laureano Moro-Velazquez, Najim Dehak, Chelsie Motley, Brittney C Muir, Ankur Butala, Kimberly Kontson
Wearable movement sensors are powerful tools for objectively characterizing and quantifying movement. They enhance the precise characterization of gait, balance, and motor symptoms in Parkinson's disease and related disorders, facilitating in-clinic and remote assessments, disease management, and therapeutic intervention development. Access to high-quality data from these sensors can accelerate discoveries in this clinical population. The WearGait-PD open-access dataset contains raw inertial measurement unit (IMU) and sensorized insole data from 100 individuals with PD and 85 age-matched controls, synchronized to a gait walkway reference system. IMU data include 3-degree of freedom (DOF) acceleration, rotational velocity, magnetic field strength, and orientation for each of 13 sensors on the participant's body. Sensor insole data include absolute pressure from 16 sensors in each insole and 3-DOF acceleration and rotational velocity. Walkway data include 2D position and relative pressure for each active sensor during every footfall. Frame-by-frame annotation of participant actions during gait and balance tasks was incorporated using synchronized video cameras. All data were associated with demographic information and clinical evaluations (e.g., medications, DBS-status, MDS-UPDRS scores).
{"title":"WearGait-PD: An Open-Access Wearables Dataset for Gait in Parkinson's Disease and Age-Matched Controls.","authors":"Anthony J Anderson, David Eguren, Michael A Gonzalez, Michael Caiola, Naima Khan, Sophia Watkinson, Isabella Zuccaroli, Siegfried S Hirczy, Cyrus P Zabetian, Kelly Mills, Emile Moukheiber, Laureano Moro-Velazquez, Najim Dehak, Chelsie Motley, Brittney C Muir, Ankur Butala, Kimberly Kontson","doi":"10.1038/s41597-026-06806-2","DOIUrl":"https://doi.org/10.1038/s41597-026-06806-2","url":null,"abstract":"<p><p>Wearable movement sensors are powerful tools for objectively characterizing and quantifying movement. They enhance the precise characterization of gait, balance, and motor symptoms in Parkinson's disease and related disorders, facilitating in-clinic and remote assessments, disease management, and therapeutic intervention development. Access to high-quality data from these sensors can accelerate discoveries in this clinical population. The WearGait-PD open-access dataset contains raw inertial measurement unit (IMU) and sensorized insole data from 100 individuals with PD and 85 age-matched controls, synchronized to a gait walkway reference system. IMU data include 3-degree of freedom (DOF) acceleration, rotational velocity, magnetic field strength, and orientation for each of 13 sensors on the participant's body. Sensor insole data include absolute pressure from 16 sensors in each insole and 3-DOF acceleration and rotational velocity. Walkway data include 2D position and relative pressure for each active sensor during every footfall. Frame-by-frame annotation of participant actions during gait and balance tasks was incorporated using synchronized video cameras. All data were associated with demographic information and clinical evaluations (e.g., medications, DBS-status, MDS-UPDRS scores).</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06837-9
Oldřich Čížek, Pavel Marhoul, Tomáš Kadlec, Oto Kaláb, Tomáš Jor, Antonín Hlaváček
Climate change is reshaping ecosystems worldwide, yet our ability to quantify its long-term impact across taxa is limited by a lack of reliable and comparable data. Here, we present a systematically collected long-term dataset spanning nearly a decade (2012-2021), documenting the diversity, abundance, and distribution of 439 moth species (Lepidoptera: Heterocera) from the Czech part of the Giant Mountains, a region entirely protected as Krkonoše National Park. Using standardised light traps, we sampled 982 localities across an area of 550 km², yielding a total of 64,776 specimens. Localities are accompanied by in-situ assessments of vegetation characteristics and management regimes, complemented by topographical derivatives and ecosystem information retrieved post-hoc from open spatial data. The dataset provides a valuable resource for investigating spatial and temporal patterns in moth diversity and abundance, as well as for evaluating the effects of different management practices, supporting both basic and applied research.
{"title":"Full-elevational gradient dataset on moth diversity and abundance in a temperate mountain range.","authors":"Oldřich Čížek, Pavel Marhoul, Tomáš Kadlec, Oto Kaláb, Tomáš Jor, Antonín Hlaváček","doi":"10.1038/s41597-026-06837-9","DOIUrl":"https://doi.org/10.1038/s41597-026-06837-9","url":null,"abstract":"<p><p>Climate change is reshaping ecosystems worldwide, yet our ability to quantify its long-term impact across taxa is limited by a lack of reliable and comparable data. Here, we present a systematically collected long-term dataset spanning nearly a decade (2012-2021), documenting the diversity, abundance, and distribution of 439 moth species (Lepidoptera: Heterocera) from the Czech part of the Giant Mountains, a region entirely protected as Krkonoše National Park. Using standardised light traps, we sampled 982 localities across an area of 550 km², yielding a total of 64,776 specimens. Localities are accompanied by in-situ assessments of vegetation characteristics and management regimes, complemented by topographical derivatives and ecosystem information retrieved post-hoc from open spatial data. The dataset provides a valuable resource for investigating spatial and temporal patterns in moth diversity and abundance, as well as for evaluating the effects of different management practices, supporting both basic and applied research.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06758-7
Juan Trujillo, Rosario Ferrer-Cascales, Miguel A Teruel, Nicolás Ruiz-Robledillo, Javier Sanchis, Sandra García-Ponsoda, Alejandro Panagiotidis-Arrizabalaga, Natalia Albaladejo-Blázquez, Ángela Martínez-Nicolás, Jorge García-Carrasco, Alejandro Reina, Ana Lavalle, Alejandro Maté, Borja Costa-López
Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder characterized by inattention, hyperactivity, and impulsivity. Current diagnostic methods rely primarily on subjective clinical evaluations, which are prone to bias. Neurophysiological techniques such as electroencephalography (EEG), eye tracking, and electrodermal activity (EDA) offer promising objective alternatives; however, their adoption is limited by the scarcity of large, public, multimodal datasets. To address this gap, we introduce the BALLADEER ADHD Dataset, a comprehensive multimodal resource that integrates simultaneous EEG, eye-tracking, and physiological signals from children and adolescents with ADHD and neurotypical controls. Data were collected through carefully designed cognitive tasks aimed at eliciting neurophysiological responses related to attentional control, response inhibition, and cognitive flexibility-key domains affected in ADHD. The dataset facilitates the development of machine learning models for ADHD classification and biomarker discovery through cross-modal analyses of EEG, eye movements, and autonomic nervous system activity. By publicly releasing this dataset, we aim to enhance transparency, reproducibility, and innovation in computational neuroscience and ADHD research.
{"title":"A Multimodal Dataset for Neurophysiological and AI Applications.","authors":"Juan Trujillo, Rosario Ferrer-Cascales, Miguel A Teruel, Nicolás Ruiz-Robledillo, Javier Sanchis, Sandra García-Ponsoda, Alejandro Panagiotidis-Arrizabalaga, Natalia Albaladejo-Blázquez, Ángela Martínez-Nicolás, Jorge García-Carrasco, Alejandro Reina, Ana Lavalle, Alejandro Maté, Borja Costa-López","doi":"10.1038/s41597-026-06758-7","DOIUrl":"https://doi.org/10.1038/s41597-026-06758-7","url":null,"abstract":"<p><p>Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder characterized by inattention, hyperactivity, and impulsivity. Current diagnostic methods rely primarily on subjective clinical evaluations, which are prone to bias. Neurophysiological techniques such as electroencephalography (EEG), eye tracking, and electrodermal activity (EDA) offer promising objective alternatives; however, their adoption is limited by the scarcity of large, public, multimodal datasets. To address this gap, we introduce the BALLADEER ADHD Dataset, a comprehensive multimodal resource that integrates simultaneous EEG, eye-tracking, and physiological signals from children and adolescents with ADHD and neurotypical controls. Data were collected through carefully designed cognitive tasks aimed at eliciting neurophysiological responses related to attentional control, response inhibition, and cognitive flexibility-key domains affected in ADHD. The dataset facilitates the development of machine learning models for ADHD classification and biomarker discovery through cross-modal analyses of EEG, eye movements, and autonomic nervous system activity. By publicly releasing this dataset, we aim to enhance transparency, reproducibility, and innovation in computational neuroscience and ADHD research.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06721-6
Sijia Feng, Aoyang Li, Rui Zhou, Klaus Butterbach-Bahl, Kaiyu Guan, Zhenong Jin, Majken C Looms, Sherrie Wang, Christian Igel, Claire Treat, Jørgen Eivind Olesen, Sheng Wang
Accurate estimation of surface soil moisture (SM) in terrestrial ecosystems is essential for understanding hydroclimate dynamics. The L-band Soil Moisture Active Passive (SMAP) mission provides 9-km global daily surface SM by using a microwave radiative transfer model (RTM)-based algorithm. However, the accuracy of SMAP SM is limited in regions with dense vegetation cover and complex surface conditions, due to the empirical parameterization and oversimplified radiative transfer processes. To overcome the limitations, we developed a Process-Guided Machine Learning (PGML) framework to integrate RTM theories and deep learning to predict global daily surface 9-km SM from April 2015 to June 2025. Informed by domain knowledge, we developed the PGML model structure using RTM and hydrological theories, designed a Kling-Gupta efficiency-based cost function, pretrained it with RTM simulations, and fine-tuned it with in-situ measurements. The independent validation shows that PGML SM has strong agreement with in-situ measurements (R = 0.868 and unbiased RMSE = 0.054 m3/m3). This study highlights the potential of PGML to enhance the accuracy of satellite SM, thereby supporting improved water resources and ecosystem management.
{"title":"Global daily 9 km remotely sensed soil moisture (2015-2025) with microwave radiative transfer-guided learning.","authors":"Sijia Feng, Aoyang Li, Rui Zhou, Klaus Butterbach-Bahl, Kaiyu Guan, Zhenong Jin, Majken C Looms, Sherrie Wang, Christian Igel, Claire Treat, Jørgen Eivind Olesen, Sheng Wang","doi":"10.1038/s41597-026-06721-6","DOIUrl":"https://doi.org/10.1038/s41597-026-06721-6","url":null,"abstract":"<p><p>Accurate estimation of surface soil moisture (SM) in terrestrial ecosystems is essential for understanding hydroclimate dynamics. The L-band Soil Moisture Active Passive (SMAP) mission provides 9-km global daily surface SM by using a microwave radiative transfer model (RTM)-based algorithm. However, the accuracy of SMAP SM is limited in regions with dense vegetation cover and complex surface conditions, due to the empirical parameterization and oversimplified radiative transfer processes. To overcome the limitations, we developed a Process-Guided Machine Learning (PGML) framework to integrate RTM theories and deep learning to predict global daily surface 9-km SM from April 2015 to June 2025. Informed by domain knowledge, we developed the PGML model structure using RTM and hydrological theories, designed a Kling-Gupta efficiency-based cost function, pretrained it with RTM simulations, and fine-tuned it with in-situ measurements. The independent validation shows that PGML SM has strong agreement with in-situ measurements (R = 0.868 and unbiased RMSE = 0.054 m<sup>3</sup>/m<sup>3</sup>). This study highlights the potential of PGML to enhance the accuracy of satellite SM, thereby supporting improved water resources and ecosystem management.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06794-3
Kevin Varga, Charles Jones
Live fuel moisture content (LFMC) strongly affects the behavior of wildland fire, resulting in its incorporation into wildfire spread models and danger ratings. In this study, over ten thousand LFMC observations are combined with predictor variables from Landsat imagery and the Weather Research and Forecasting model to train species-specific random forest models that predict the LFMC of four fuel types-chamise, old growth chamise, black sage, and bigpod ceanothus. These models are then utilized to create a historical, 32-year long, LFMC dataset in southern California chaparral. Additionally, the high spatial and temporal sampling frequency of chamise allowed for quantile mapping bias correction to be applied. The final chamise output, which is the most robust, has a mean absolute error of 9.68% and an R2 value of 0.76. The LFMC dataset successfully captures the variability in the annual cycle, the spatial heterogeneity, and the interspecies differences, which makes it applicable for better understanding varying fire season characteristics and landscape level flammability.
{"title":"A 32-year species-specific live fuel moisture content dataset for southern California chaparral.","authors":"Kevin Varga, Charles Jones","doi":"10.1038/s41597-026-06794-3","DOIUrl":"https://doi.org/10.1038/s41597-026-06794-3","url":null,"abstract":"<p><p>Live fuel moisture content (LFMC) strongly affects the behavior of wildland fire, resulting in its incorporation into wildfire spread models and danger ratings. In this study, over ten thousand LFMC observations are combined with predictor variables from Landsat imagery and the Weather Research and Forecasting model to train species-specific random forest models that predict the LFMC of four fuel types-chamise, old growth chamise, black sage, and bigpod ceanothus. These models are then utilized to create a historical, 32-year long, LFMC dataset in southern California chaparral. Additionally, the high spatial and temporal sampling frequency of chamise allowed for quantile mapping bias correction to be applied. The final chamise output, which is the most robust, has a mean absolute error of 9.68% and an R<sup>2</sup> value of 0.76. The LFMC dataset successfully captures the variability in the annual cycle, the spatial heterogeneity, and the interspecies differences, which makes it applicable for better understanding varying fire season characteristics and landscape level flammability.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The pronounced heterogeneity of the tumor microenvironment (TME) in colorectal cancer (CRC) presents major obstacles in accurately predicting patient outcomes and tailoring treatment responses. Deciphering this intricate microenvironment based on histological images and classifying it into well-defined tissue components is critical for optimizing clinical interventions. Although deep learning (DL) has advanced substantially in medical imaging analysis, its application in CRC remains limited due to a shortage of comprehensively annotated datasets and large-scale, high-quality histological images. To address this gap, we present HMU-CRC-Hist550K, a curated dataset comprising 550,000 annotated image tiles derived from 500 whole-slide images, fully labeled into eight distinct TME tissue classes. The dataset represents a broad collection of publicly available CRC histology samples. Additionally, we demonstrate the utility of this resource by benchmarking three DL models on tissue segmentation tasks. HMU-CRC-Hist550K offers a valuable foundation for TME profiling, AI-assisted diagnosis, molecular subtype inference, and individualized therapy planning, while also enabling new research directions in modeling the spatial-temporal evolution of the TME.
{"title":"Large-Scale Histological Image Dataset with Metadata for Colorectal Cancer Microenvironment.","authors":"Hao Wang, Huiying Li, Jingmin Xue, Yang Jiang, Keru Ma, Fenqi Du, Genshen Mo, Hao Li, Yuze Huang, Haonan Xie, Hongxue Meng, Peng Han, Shenghan Lou","doi":"10.1038/s41597-026-06675-9","DOIUrl":"https://doi.org/10.1038/s41597-026-06675-9","url":null,"abstract":"<p><p>The pronounced heterogeneity of the tumor microenvironment (TME) in colorectal cancer (CRC) presents major obstacles in accurately predicting patient outcomes and tailoring treatment responses. Deciphering this intricate microenvironment based on histological images and classifying it into well-defined tissue components is critical for optimizing clinical interventions. Although deep learning (DL) has advanced substantially in medical imaging analysis, its application in CRC remains limited due to a shortage of comprehensively annotated datasets and large-scale, high-quality histological images. To address this gap, we present HMU-CRC-Hist550K, a curated dataset comprising 550,000 annotated image tiles derived from 500 whole-slide images, fully labeled into eight distinct TME tissue classes. The dataset represents a broad collection of publicly available CRC histology samples. Additionally, we demonstrate the utility of this resource by benchmarking three DL models on tissue segmentation tasks. HMU-CRC-Hist550K offers a valuable foundation for TME profiling, AI-assisted diagnosis, molecular subtype inference, and individualized therapy planning, while also enabling new research directions in modeling the spatial-temporal evolution of the TME.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olfaction is the primary sensory modality governing maternal behavior in rodents. To meet the demands of maternal care, the brain undergoes extensive and temporally dynamic plasticity during the perinatal period, particularly within the olfactory bulb (OB). However, longitudinal data describing the molecular landscape of the OB across the entire reproductive cycle are currently unavailable. We generated a high-resolution transcriptomic dataset of the mouse OB to map molecular reprogramming events during reproduction. Samples were collected at five strategic time points: non-pregnancy, gestation day 10, parturition, postpartum day 7, and weaning. Using bulk RNA-seq, we constructed a dynamic transcriptomic atlas of the maternal OB. This dataset captures stage-specific gene expression changes associated with neurogenesis, synaptic plasticity, and neuromodulation. This work provides a critical molecular resource to facilitate future research into the adaptive remodeling of the maternal neural circuit.
{"title":"A time-series transcriptomic dataset of the mouse olfactory bulb across pregnancy and lactation.","authors":"Xiaolei Song, Gengwei Zhang, Fengzhu Zhang, Tongye Fu, Jingzhe Yu, Danyu Han, Wenhui Li, Rongliang Guo","doi":"10.1038/s41597-026-06833-z","DOIUrl":"https://doi.org/10.1038/s41597-026-06833-z","url":null,"abstract":"<p><p>Olfaction is the primary sensory modality governing maternal behavior in rodents. To meet the demands of maternal care, the brain undergoes extensive and temporally dynamic plasticity during the perinatal period, particularly within the olfactory bulb (OB). However, longitudinal data describing the molecular landscape of the OB across the entire reproductive cycle are currently unavailable. We generated a high-resolution transcriptomic dataset of the mouse OB to map molecular reprogramming events during reproduction. Samples were collected at five strategic time points: non-pregnancy, gestation day 10, parturition, postpartum day 7, and weaning. Using bulk RNA-seq, we constructed a dynamic transcriptomic atlas of the maternal OB. This dataset captures stage-specific gene expression changes associated with neurogenesis, synaptic plasticity, and neuromodulation. This work provides a critical molecular resource to facilitate future research into the adaptive remodeling of the maternal neural circuit.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06798-z
Fermin Travi, Bruno Bianchi, Diego Fernandez Slezak, Juan E Kamienkowski
Eye-tracking is a well-established method for studying reading processes. Our gaze jumps word to word, sampling information almost sequentially. Time spent on each word, along with skipping or revisiting patterns, provides proxies for cognitive processes during comprehension. However, few studies have focused on Spanish, where empirical data remain scarce, and little is known about how findings from other languages translate to Spanish reading behavior. We present the largest publicly available Spanish eye-tracking dataset to date, comprising readings of self-contained stories from 113 native speakers (mean age 23.8; 61 females, 52 males). The dataset comprises both long stories (3300 ± 747 words, 11 readings per item on average) and short stories (795 ± 135 words, 50 readings per item on average), providing extensive coverage of natural reading scenarios with over 940,000 fixations covering close to 40,000 words (8,500 unique words). This comprehensive resource offers opportunities to investigate Spanish eye movement patterns, explore language-specific cognitive processes, examine Spanish linguistic phenomena, and develop computational algorithms for reading research and natural language processing applications.
{"title":"Cuentos: A Large-Scale Eye-Tracking Reading Corpus on Spanish Narrative Texts.","authors":"Fermin Travi, Bruno Bianchi, Diego Fernandez Slezak, Juan E Kamienkowski","doi":"10.1038/s41597-026-06798-z","DOIUrl":"https://doi.org/10.1038/s41597-026-06798-z","url":null,"abstract":"<p><p>Eye-tracking is a well-established method for studying reading processes. Our gaze jumps word to word, sampling information almost sequentially. Time spent on each word, along with skipping or revisiting patterns, provides proxies for cognitive processes during comprehension. However, few studies have focused on Spanish, where empirical data remain scarce, and little is known about how findings from other languages translate to Spanish reading behavior. We present the largest publicly available Spanish eye-tracking dataset to date, comprising readings of self-contained stories from 113 native speakers (mean age 23.8; 61 females, 52 males). The dataset comprises both long stories (3300 ± 747 words, 11 readings per item on average) and short stories (795 ± 135 words, 50 readings per item on average), providing extensive coverage of natural reading scenarios with over 940,000 fixations covering close to 40,000 words (8,500 unique words). This comprehensive resource offers opportunities to investigate Spanish eye movement patterns, explore language-specific cognitive processes, examine Spanish linguistic phenomena, and develop computational algorithms for reading research and natural language processing applications.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1038/s41597-026-06815-1
Mengqi Xu, Yuge Cui, Hongcheng Kuang, Kai Wei, Wenjuan Shan
The Yarkand hare (Lepus yarkandensis) is endemic to the Tarim Basin in Xinjiang, China. It is a key species and a critical component of the Tarim Basin ecosystems. However, the lack of a reference genome has hindered evolutionary and genetic studies of this species. Here, we assembled a telomere-to-telomere (T2T) genome of the Yarkand hare (LepYark_1.0) using PacBio HiFi, Nanopore, and Hi-C sequencing. The assembled genome size is approximately 2.70 Gb, with a scaffold N50 of 126.86 Mb. About 94.88% of the assembled sequences could be anchored to 24 pseudo-chromosomes, with a BUSCO assessment indicating a completeness of 99.0%. Repetitive sequences comprise 46.38% of the genome, with short interspersed nuclear elements (SINEs) accounting for the largest proportion. Additionally, we identified 24 centromeres and 46 telomeres. 32,298 protein-coding genes were annotated using de novo prediction and transcriptome data, functionally annotating 85% of them. This genome assembly provides genomic resources for studies on conservation, adaptive evolution and the exploration of genetic basis related to important traits of the Yarkand hare.
{"title":"Telomere to telomere level genome assembly of the Yarkand hare (Lepus yarkandensis).","authors":"Mengqi Xu, Yuge Cui, Hongcheng Kuang, Kai Wei, Wenjuan Shan","doi":"10.1038/s41597-026-06815-1","DOIUrl":"https://doi.org/10.1038/s41597-026-06815-1","url":null,"abstract":"<p><p>The Yarkand hare (Lepus yarkandensis) is endemic to the Tarim Basin in Xinjiang, China. It is a key species and a critical component of the Tarim Basin ecosystems. However, the lack of a reference genome has hindered evolutionary and genetic studies of this species. Here, we assembled a telomere-to-telomere (T2T) genome of the Yarkand hare (LepYark_1.0) using PacBio HiFi, Nanopore, and Hi-C sequencing. The assembled genome size is approximately 2.70 Gb, with a scaffold N50 of 126.86 Mb. About 94.88% of the assembled sequences could be anchored to 24 pseudo-chromosomes, with a BUSCO assessment indicating a completeness of 99.0%. Repetitive sequences comprise 46.38% of the genome, with short interspersed nuclear elements (SINEs) accounting for the largest proportion. Additionally, we identified 24 centromeres and 46 telomeres. 32,298 protein-coding genes were annotated using de novo prediction and transcriptome data, functionally annotating 85% of them. This genome assembly provides genomic resources for studies on conservation, adaptive evolution and the exploration of genetic basis related to important traits of the Yarkand hare.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-11DOI: 10.1038/s41597-026-06670-0
Daniel Bottomly, Christopher G Suciu, Benjamin Cordier, Nathaniel Evans, Alfonso Poire, Christina Zheng, Jeffrey W Tyner, Alan Hutson, Shannon K McWeeney
Biomedical machine learning (ML) models raise critical concerns about embedded assumptions influencing clinical decision-making, necessitating robust documentation frameworks for datasets that are shared via external repositories. Fairness-aware algorithm effectiveness hinges on users' prior awareness of specific issues in the data - information such as data collection methodology, provenance and quality. Current ML-focused documentation approaches impose impractical burdens on data generators and conflate data/model accountability. This is problematic for resource datasets not explicitly created for ML applications. This study addresses these gaps through a two-step process: First, we derived consensus documentation fields by mapping elements across four key templates. Second, we surveyed biomedical stakeholders across four roles (clinicians, bench scientists, data manager and computationalists) to assess field importance and relevance. This revealed important role-dependent prioritization differences, motivating the development of the Biomedical Data Manifest - a modular template employing persona-specific field presentation reducing generator burden while ensuring end-users receive role-relevant information. The Biomedical Data Manifest improves transparency for datasets deposited in public or controlled-access repositories and bias mitigation across ML applications.
{"title":"Biomedical Data Manifest: A lightweight data documentation mapping to increase transparency for AI/ML.","authors":"Daniel Bottomly, Christopher G Suciu, Benjamin Cordier, Nathaniel Evans, Alfonso Poire, Christina Zheng, Jeffrey W Tyner, Alan Hutson, Shannon K McWeeney","doi":"10.1038/s41597-026-06670-0","DOIUrl":"https://doi.org/10.1038/s41597-026-06670-0","url":null,"abstract":"<p><p>Biomedical machine learning (ML) models raise critical concerns about embedded assumptions influencing clinical decision-making, necessitating robust documentation frameworks for datasets that are shared via external repositories. Fairness-aware algorithm effectiveness hinges on users' prior awareness of specific issues in the data - information such as data collection methodology, provenance and quality. Current ML-focused documentation approaches impose impractical burdens on data generators and conflate data/model accountability. This is problematic for resource datasets not explicitly created for ML applications. This study addresses these gaps through a two-step process: First, we derived consensus documentation fields by mapping elements across four key templates. Second, we surveyed biomedical stakeholders across four roles (clinicians, bench scientists, data manager and computationalists) to assess field importance and relevance. This revealed important role-dependent prioritization differences, motivating the development of the Biomedical Data Manifest - a modular template employing persona-specific field presentation reducing generator burden while ensuring end-users receive role-relevant information. The Biomedical Data Manifest improves transparency for datasets deposited in public or controlled-access repositories and bias mitigation across ML applications.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}