Pub Date : 2026-02-23DOI: 10.1038/s41597-026-06872-6
Francisco Morillas-Espejo, Ester Martinez-Martin
Sign Language Recognition (SLR) is a critical component of human-machine interaction, enabling more inclusive technologies for the deaf and hard-of-hearing community. However, current datasets often suffer from data sparsity and a bias toward right-handed signs. To support this effort, we present Sign4all, a dataset for Spanish Sign Language (LSE), specifically designed for Isolated Sign Language Recognition (ISLR). The dataset is composed of 7,756 high-resolution RGB video recordings and their corresponding skeletal keypoints, covering 24 signs related to daily activities, more specifically a vocabulary centered in the catering field. Unlike sparse lexicons, Sign4all adopts a high-density approach, providing an average of 323 samples per sign to facilitate data-intensive deep learning models. Moreover, the dataset provides a handedness balance, with equal representation of left- and right-handed signs for every sign to support handedness invariance. Each sample was manually segmented, temporally normalized and preprocessed through spatial normalization to guarantee consistency and compatibility with different deep learning pipelines. Technical validation using Transformer and skeletal models demonstrates the dataset's integrity and the need of providing pre-computed augmentation splits. All data is formatted in widely supported file types (AVI for video, HDF5 for keypoints), enabling direct use in machine learning frameworks such as TensorFlow or PyTorch.
{"title":"Sign4all: a Spanish Sign Language dataset.","authors":"Francisco Morillas-Espejo, Ester Martinez-Martin","doi":"10.1038/s41597-026-06872-6","DOIUrl":"https://doi.org/10.1038/s41597-026-06872-6","url":null,"abstract":"<p><p>Sign Language Recognition (SLR) is a critical component of human-machine interaction, enabling more inclusive technologies for the deaf and hard-of-hearing community. However, current datasets often suffer from data sparsity and a bias toward right-handed signs. To support this effort, we present Sign4all, a dataset for Spanish Sign Language (LSE), specifically designed for Isolated Sign Language Recognition (ISLR). The dataset is composed of 7,756 high-resolution RGB video recordings and their corresponding skeletal keypoints, covering 24 signs related to daily activities, more specifically a vocabulary centered in the catering field. Unlike sparse lexicons, Sign4all adopts a high-density approach, providing an average of 323 samples per sign to facilitate data-intensive deep learning models. Moreover, the dataset provides a handedness balance, with equal representation of left- and right-handed signs for every sign to support handedness invariance. Each sample was manually segmented, temporally normalized and preprocessed through spatial normalization to guarantee consistency and compatibility with different deep learning pipelines. Technical validation using Transformer and skeletal models demonstrates the dataset's integrity and the need of providing pre-computed augmentation splits. All data is formatted in widely supported file types (AVI for video, HDF5 for keypoints), enabling direct use in machine learning frameworks such as TensorFlow or PyTorch.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-23DOI: 10.1038/s41597-026-06848-6
Bingjie Li, Binglong Xia, Ze Cheng, Yitong Xu, Zhao Duan
Although corporate sustainability reports increasingly employ visual rhetoric to influence stakeholder perceptions, quantitative tools for objectively measuring these strategies remain limited. Here we present the Non-Financial Information Disclosure Visual Representations Index (NFIVI) dataset, a dynamic resource covering Chinese listed companies. While the current release (2006-2024) encompasses a comprehensive collection of these reports, the dataset is updated annually, with data volume steadily increasing as new reports are processed. Utilizing a pipeline integrating layout analysis and computer vision, we decompose reports into three fundamental elements: text, image, and color. This dataset introduces two indices to objectively quantify visual composition and structure: the Feature-Correlation Index (NFIVI_FC), measuring stylistic consistency through multidimensional feature coherence, and the Information Entropy Index (NFIVI_EI), assessing visual complexity based on color diversity. Alongside 18 granular indicators spanning the text, image, and color dimensions at both page and document levels, these indices operationalize abstract design concepts into computable metrics. This resource enables large-scale quantitative research into corporate impression management and supports the development of automated auditing tools for non-financial disclosures.
{"title":"A multi-level visual representation dataset for large-scale non-financial information disclosure.","authors":"Bingjie Li, Binglong Xia, Ze Cheng, Yitong Xu, Zhao Duan","doi":"10.1038/s41597-026-06848-6","DOIUrl":"https://doi.org/10.1038/s41597-026-06848-6","url":null,"abstract":"<p><p>Although corporate sustainability reports increasingly employ visual rhetoric to influence stakeholder perceptions, quantitative tools for objectively measuring these strategies remain limited. Here we present the Non-Financial Information Disclosure Visual Representations Index (NFIVI) dataset, a dynamic resource covering Chinese listed companies. While the current release (2006-2024) encompasses a comprehensive collection of these reports, the dataset is updated annually, with data volume steadily increasing as new reports are processed. Utilizing a pipeline integrating layout analysis and computer vision, we decompose reports into three fundamental elements: text, image, and color. This dataset introduces two indices to objectively quantify visual composition and structure: the Feature-Correlation Index (NFIVI_FC), measuring stylistic consistency through multidimensional feature coherence, and the Information Entropy Index (NFIVI_EI), assessing visual complexity based on color diversity. Alongside 18 granular indicators spanning the text, image, and color dimensions at both page and document levels, these indices operationalize abstract design concepts into computable metrics. This resource enables large-scale quantitative research into corporate impression management and supports the development of automated auditing tools for non-financial disclosures.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Intrapartum biometry is of vital significance in monitoring labor progress. However, the realization of AI-based end-to-end intrapartum biometry and labor progress assessment requires intrapartum ultrasound video datasets with multi-category annotations, and currently, there is no public video dataset available for multi-category fine-grained classification. While several image datasets exist for related tasks (e.g., JNU-IFM, PSFHS, IUGC), a dedicated benchmark in the video domain remains unavailable. To bridge this gap, we have publicly released, for the first time, a multi-center, multi-device, and multi-category labeled intrapartum ultrasound dataset. This dataset comprises 774 videos / 68,106 images, along with corresponding standard plane classification labels, multi-class segmentation labels of pubic symphysis and fetal head, and two ultrasound parameter labels that characterize labor progress. This dataset can facilitate research on multi-task learning methods and the development of end-to-end automated approaches, especially in the automation of obstetric processes and auxiliary decision-making.
{"title":"Maternal-Fetal Ultrasouno Video Dataset for End-to-end Intrapartum Biometry and Multi-task Learning.","authors":"Ming Niu, Jieyun Bai, Yunbo Gao, Yitong Tang, Yaosheng Lu, Zhenyan Han, Hongying Hou, Yuxin Huang","doi":"10.1038/s41597-026-06900-5","DOIUrl":"https://doi.org/10.1038/s41597-026-06900-5","url":null,"abstract":"<p><p>Intrapartum biometry is of vital significance in monitoring labor progress. However, the realization of AI-based end-to-end intrapartum biometry and labor progress assessment requires intrapartum ultrasound video datasets with multi-category annotations, and currently, there is no public video dataset available for multi-category fine-grained classification. While several image datasets exist for related tasks (e.g., JNU-IFM, PSFHS, IUGC), a dedicated benchmark in the video domain remains unavailable. To bridge this gap, we have publicly released, for the first time, a multi-center, multi-device, and multi-category labeled intrapartum ultrasound dataset. This dataset comprises 774 videos / 68,106 images, along with corresponding standard plane classification labels, multi-class segmentation labels of pubic symphysis and fetal head, and two ultrasound parameter labels that characterize labor progress. This dataset can facilitate research on multi-task learning methods and the development of end-to-end automated approaches, especially in the automation of obstetric processes and auxiliary decision-making.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-23DOI: 10.1038/s41597-026-06889-x
Xiao Zhang, Hao Zhang, Chen Chen, Yuemei Zhao
Gynostemma guangxiense X. X. Chen & D. H. Qin, belonging to the family Cucurbitaceae, is a perennial creeping herbaceous plant endemic to China with potential medicinal and health value. Here, we report the high-quality chromosome-level genome of G. guangxiense obtained by integrating Illumina short read, PacBio high-fidelity (HiFi) long read, Hi-C, and RNA-Seq technologies. The genome is anchored to 11 pseudochromosomes with a total size of 565.18 Mb, with a scaffold N50 of 52.63 Mb, achieving a complete BUSCO of 98.00%. Furthermore, we identified 27,527 protein-coding genes, of which 97.75% were functionally annotated. This genome provides an important molecular foundation for adaptive evolution, genetic conservation, and effective development of valuable medicinal plant resources within the Gynostemma genus.
绞股蓝(Gynostemma guangxiense)是葫芦科植物,是中国特有的多年生匍匐草本植物,具有潜在的药用和保健价值。本文报道了利用Illumina short - read、PacBio high-fidelity (HiFi) long - read、Hi-C和RNA-Seq技术获得的高质量光仙鸟染色体水平基因组。基因组锚定在11条假染色体上,总大小为565.18 Mb,支架N50为52.63 Mb, BUSCO为98.00%。此外,我们鉴定了27,527个蛋白质编码基因,其中97.75%被功能注释。该基因组为绞绞线属药用植物的适应性进化、遗传保护和有效开发提供了重要的分子基础。
{"title":"A high-quality Chromosome-level genome assembly of Gynostemma guangxiense (Cucurbitaceae).","authors":"Xiao Zhang, Hao Zhang, Chen Chen, Yuemei Zhao","doi":"10.1038/s41597-026-06889-x","DOIUrl":"https://doi.org/10.1038/s41597-026-06889-x","url":null,"abstract":"<p><p>Gynostemma guangxiense X. X. Chen & D. H. Qin, belonging to the family Cucurbitaceae, is a perennial creeping herbaceous plant endemic to China with potential medicinal and health value. Here, we report the high-quality chromosome-level genome of G. guangxiense obtained by integrating Illumina short read, PacBio high-fidelity (HiFi) long read, Hi-C, and RNA-Seq technologies. The genome is anchored to 11 pseudochromosomes with a total size of 565.18 Mb, with a scaffold N50 of 52.63 Mb, achieving a complete BUSCO of 98.00%. Furthermore, we identified 27,527 protein-coding genes, of which 97.75% were functionally annotated. This genome provides an important molecular foundation for adaptive evolution, genetic conservation, and effective development of valuable medicinal plant resources within the Gynostemma genus.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1038/s41597-026-06882-4
Bofeng Han, Yan Chen, Weijie Ouyang, Danyi Chen, Jiawei Li, Weien Liang, Xudong Zhang, Chengxi Wei, Ling Liu, Sen Yan, Zhuchi Tu
Utilizing non-human primates to study the role of human Tau and its related pathologies is logical and important due to their closer similarity to human brain structure and function. In our earlier research, we generated a transgenic cynomolgus monkey model expressing Tau (P301L) through lentiviral infection of monkey embryos. These monkeys exhibited age-dependent neurodegeneration and motor dysfunction. Single-nucleus RNA sequencing (snRNA-seq) is a powerful and promising technique for elucidating the cellular complexity and pathology across different tissues. However, single-cell data from non-human primate models of Tau pathology are currently nonexistent. In this study, we performed snRNA-seq on the hippocampus, striatum, and spinal cord of Tau (P301L) monkey, providing the first snRNA-seq atlas of multiple tissue regions in a non-human primate model that simulates human tauopathies. This will offer crucial data references for cross-species single-cell level studies of tau and its related pathologies.
{"title":"Single-nucleus RNA sequencing dataset of diverse tissues from wild-type monkey and Tau-P301L transgenic monkey.","authors":"Bofeng Han, Yan Chen, Weijie Ouyang, Danyi Chen, Jiawei Li, Weien Liang, Xudong Zhang, Chengxi Wei, Ling Liu, Sen Yan, Zhuchi Tu","doi":"10.1038/s41597-026-06882-4","DOIUrl":"https://doi.org/10.1038/s41597-026-06882-4","url":null,"abstract":"<p><p>Utilizing non-human primates to study the role of human Tau and its related pathologies is logical and important due to their closer similarity to human brain structure and function. In our earlier research, we generated a transgenic cynomolgus monkey model expressing Tau (P301L) through lentiviral infection of monkey embryos. These monkeys exhibited age-dependent neurodegeneration and motor dysfunction. Single-nucleus RNA sequencing (snRNA-seq) is a powerful and promising technique for elucidating the cellular complexity and pathology across different tissues. However, single-cell data from non-human primate models of Tau pathology are currently nonexistent. In this study, we performed snRNA-seq on the hippocampus, striatum, and spinal cord of Tau (P301L) monkey, providing the first snRNA-seq atlas of multiple tissue regions in a non-human primate model that simulates human tauopathies. This will offer crucial data references for cross-species single-cell level studies of tau and its related pathologies.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146776634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1038/s41597-026-06851-x
Mark Dourado, Henrik Gert Hassager, Jesper Udesen, Stefania Serafin
The GaMMA (Gaze, Motion, and Multi-talker Audio) corpus captures the behavior of polyadic conversations among native Danish speakers under both normal and cocktail party conditions. Eleven groups of four normal-hearing participants are recorded while engaged in natural and spontaneous interactions. All conversations were conducted without conversational tasks. Each group was intentionally composed of participants with prior intragroup and interpersonal relations. Gaze and motion data were collected using an optical tracking system and eye-tracking glasses, while speech was recorded via omnidirectional head-worn microphones and binaural hearing aid microphones with low occlusion. Calibrations were conducted before trials and compensation filters were created to account for differences in microphone placements. Processed versions of the audio signals, with background noise attenuated and crosstalk removed, were used to compute speech activity for all participants. The corpus, including both raw and processed gaze and audio data, as well as filters, calibration signals, and speech activity output, is publicly available.
{"title":"The GaMMA corpus of Danish polyadic conversations with gaze speech and motion data in quiet and noise.","authors":"Mark Dourado, Henrik Gert Hassager, Jesper Udesen, Stefania Serafin","doi":"10.1038/s41597-026-06851-x","DOIUrl":"https://doi.org/10.1038/s41597-026-06851-x","url":null,"abstract":"<p><p>The GaMMA (Gaze, Motion, and Multi-talker Audio) corpus captures the behavior of polyadic conversations among native Danish speakers under both normal and cocktail party conditions. Eleven groups of four normal-hearing participants are recorded while engaged in natural and spontaneous interactions. All conversations were conducted without conversational tasks. Each group was intentionally composed of participants with prior intragroup and interpersonal relations. Gaze and motion data were collected using an optical tracking system and eye-tracking glasses, while speech was recorded via omnidirectional head-worn microphones and binaural hearing aid microphones with low occlusion. Calibrations were conducted before trials and compensation filters were created to account for differences in microphone placements. Processed versions of the audio signals, with background noise attenuated and crosstalk removed, were used to compute speech activity for all participants. The corpus, including both raw and processed gaze and audio data, as well as filters, calibration signals, and speech activity output, is publicly available.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146259145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1038/s41597-026-06879-z
Evgeny D Petrovskiy
Dense longitudinal neuroimaging usually requires substantial institutional resources, yet can also be achieved by an individual using standard clinical MRI infrastructure. This work presents a multimodal single-subject dataset comprising 85 hours of resting-state fMRI acquired over 11 months, including 51.6 hours under a standardized protocol (paired eyes-open/-closed runs, 128 sessions over 7.5 months). Additional data include 195 T1-weighted structural scans, 54 diffusion MRI sessions, physiological recordings, pre-session behavioral assessments, and detailed medication and lifestyle logs. Scans were collected primarily via self-administered acquisition on a clinical 3 T system, with sub-3 mm between-session positioning reproducibility observed in later sessions. Quality control identified 58 hours of low-motion data (mean framewise displacement <0.2 mm), with higher-motion runs occurring predominantly during sleep. The acquisition period included antidepressant dose changes and seasonal variation, forming a single-subject naturalistic context with collinear factors that preclude causal inference. The dataset follows the BIDS standard and is intended for methodological development, reliability analyses, preprocessing benchmarking, and educational use.
{"title":"A dense longitudinal multimodal single-subject rs-fMRI dataset acquired by self-administered scanning.","authors":"Evgeny D Petrovskiy","doi":"10.1038/s41597-026-06879-z","DOIUrl":"https://doi.org/10.1038/s41597-026-06879-z","url":null,"abstract":"<p><p>Dense longitudinal neuroimaging usually requires substantial institutional resources, yet can also be achieved by an individual using standard clinical MRI infrastructure. This work presents a multimodal single-subject dataset comprising 85 hours of resting-state fMRI acquired over 11 months, including 51.6 hours under a standardized protocol (paired eyes-open/-closed runs, 128 sessions over 7.5 months). Additional data include 195 T1-weighted structural scans, 54 diffusion MRI sessions, physiological recordings, pre-session behavioral assessments, and detailed medication and lifestyle logs. Scans were collected primarily via self-administered acquisition on a clinical 3 T system, with sub-3 mm between-session positioning reproducibility observed in later sessions. Quality control identified 58 hours of low-motion data (mean framewise displacement <0.2 mm), with higher-motion runs occurring predominantly during sleep. The acquisition period included antidepressant dose changes and seasonal variation, forming a single-subject naturalistic context with collinear factors that preclude causal inference. The dataset follows the BIDS standard and is intended for methodological development, reliability analyses, preprocessing benchmarking, and educational use.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146776579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1038/s41597-026-06888-y
Belén Méndez-Cea, Isabel García-García, Manuel Pavesio-Toledano, Jose Luis Horreo, José Ignacio Seco, Francisco Javier Gallego, Juan Carlos Linares
The Moroccan fir (Abies marocana Trab.) is an endangered conifer endemic to the western Rif Mountains. Despite its ecological and economic significance, no transcriptomic data was previously available for the species. Here, we present the first de novo transcriptome assembly for A. marocana, generated from RNA-seq data obtained from three organs (leaf, stem, and root) subjected to different environmental conditions (drought, heat, cold, hormones, and physical damage), using both short- and long-read sequencing technologies, to achieve a comprehensive representation of the species' transcriptome. The assembly achieved a completeness value of 92.1% according to BUSCO, with 279,439 final transcripts, of which approximately 45.2% were functionally annotated. This high-quality transcriptome provides a valuable resource for advancing genetic research and supporting conservation efforts for this vulnerable species.
{"title":"De novo transcriptome assembly of the Moroccan fir, Abies marocana Trab.","authors":"Belén Méndez-Cea, Isabel García-García, Manuel Pavesio-Toledano, Jose Luis Horreo, José Ignacio Seco, Francisco Javier Gallego, Juan Carlos Linares","doi":"10.1038/s41597-026-06888-y","DOIUrl":"https://doi.org/10.1038/s41597-026-06888-y","url":null,"abstract":"<p><p>The Moroccan fir (Abies marocana Trab.) is an endangered conifer endemic to the western Rif Mountains. Despite its ecological and economic significance, no transcriptomic data was previously available for the species. Here, we present the first de novo transcriptome assembly for A. marocana, generated from RNA-seq data obtained from three organs (leaf, stem, and root) subjected to different environmental conditions (drought, heat, cold, hormones, and physical damage), using both short- and long-read sequencing technologies, to achieve a comprehensive representation of the species' transcriptome. The assembly achieved a completeness value of 92.1% according to BUSCO, with 279,439 final transcripts, of which approximately 45.2% were functionally annotated. This high-quality transcriptome provides a valuable resource for advancing genetic research and supporting conservation efforts for this vulnerable species.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146776585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1038/s41597-026-06864-6
Grant Smith, Alberto Meucci, Claire Spillman, Ron Hoeke, Vanessa Hernaman, Claire Trenham, Stefan Zieger, Bryan Hally, Emilio Echevarria
A multi-decadal global wind-wave hindcast dataset-WHACS: the Wave Hindcast for ACS-spanning 1979 to near present was developed to offer insight into historical wave conditions both directly and as boundary forcing to localised simulations. Applications for WHACS include coastal management, climate research, and renewable energy projects, ultimately helping communities and industries make informed decisions to improve safety, efficiency, and resilience regarding wave conditions. This dataset features a near-global spherical multi-cell (SMC) grid that aligns with the Bureau operational wave forecast model and has been calibrated to better represent extreme wave conditions by improving the representation of extreme winds. Spanning from 1979 to near present, WHACS available output consists of multiple hourly bulk and spectral partition wave parameters for the native SMC grid, as well as regular global and regional regridded bulk wave parameters. For the Indo-Pacific, a gridded output of full spectral data is available across exclusive economic zones.
{"title":"WHACS: An Improved Global Wave Hindcast for the Australian Climate Service.","authors":"Grant Smith, Alberto Meucci, Claire Spillman, Ron Hoeke, Vanessa Hernaman, Claire Trenham, Stefan Zieger, Bryan Hally, Emilio Echevarria","doi":"10.1038/s41597-026-06864-6","DOIUrl":"https://doi.org/10.1038/s41597-026-06864-6","url":null,"abstract":"<p><p>A multi-decadal global wind-wave hindcast dataset-WHACS: the Wave Hindcast for ACS-spanning 1979 to near present was developed to offer insight into historical wave conditions both directly and as boundary forcing to localised simulations. Applications for WHACS include coastal management, climate research, and renewable energy projects, ultimately helping communities and industries make informed decisions to improve safety, efficiency, and resilience regarding wave conditions. This dataset features a near-global spherical multi-cell (SMC) grid that aligns with the Bureau operational wave forecast model and has been calibrated to better represent extreme wave conditions by improving the representation of extreme winds. Spanning from 1979 to near present, WHACS available output consists of multiple hourly bulk and spectral partition wave parameters for the native SMC grid, as well as regular global and regional regridded bulk wave parameters. For the Indo-Pacific, a gridded output of full spectral data is available across exclusive economic zones.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146776657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1038/s41597-026-06878-0
Yikai Wu, Xuejiao Liu, Karin Hrovatin, Dezhi Wu, Stephanie Linker, Mathias Winkel, Feng Tan
The design and optimization of antibodies and nanobodies using deep generative models hold transformative potential for therapeutic and diagnostic applications, which are hindered by the fragmented and inconsistent nature of existing datasets. To address these limitations, we introduce the Antibody and Nanobody Design Dataset (ANDD), a unified dataset that integrates sequence, structure, antigen, and affinity data from 15 diverse sources. ANDD is a comprehensive resource comprising 48,683 antibody/nanobody sequences, with structural data for 24,941 entries, and antigen sequences for 12,575 entries. We further augmented the affinity data with 2,271 predicted affinity values using ANTIPASTI, a robust model for binding affinity prediction. Consequently, ANDD includes 9,557 affinity values, making it the largest dataset to date for antibody/nanobody and antigen pairs with affinity data. By addressing challenges of data fragmentation and inconsistency, ANDD provides a robust foundation for training deep generative models. With ANDD, the models can better model antibody/nanobody-antigen interactions, while design novel antibodies and nanobodies with improved specificity and efficacy, paving the way for development of targeted therapeutics.
{"title":"A Unified Dataset for Antibody and Nanobody Design Including Sequence, Structure, and Binding Affinity Data.","authors":"Yikai Wu, Xuejiao Liu, Karin Hrovatin, Dezhi Wu, Stephanie Linker, Mathias Winkel, Feng Tan","doi":"10.1038/s41597-026-06878-0","DOIUrl":"10.1038/s41597-026-06878-0","url":null,"abstract":"<p><p>The design and optimization of antibodies and nanobodies using deep generative models hold transformative potential for therapeutic and diagnostic applications, which are hindered by the fragmented and inconsistent nature of existing datasets. To address these limitations, we introduce the Antibody and Nanobody Design Dataset (ANDD), a unified dataset that integrates sequence, structure, antigen, and affinity data from 15 diverse sources. ANDD is a comprehensive resource comprising 48,683 antibody/nanobody sequences, with structural data for 24,941 entries, and antigen sequences for 12,575 entries. We further augmented the affinity data with 2,271 predicted affinity values using ANTIPASTI, a robust model for binding affinity prediction. Consequently, ANDD includes 9,557 affinity values, making it the largest dataset to date for antibody/nanobody and antigen pairs with affinity data. By addressing challenges of data fragmentation and inconsistency, ANDD provides a robust foundation for training deep generative models. With ANDD, the models can better model antibody/nanobody-antigen interactions, while design novel antibodies and nanobodies with improved specificity and efficacy, paving the way for development of targeted therapeutics.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12932709/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146776587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}