Pub Date : 2026-01-23DOI: 10.1038/s41597-026-06617-5
Feng-Qi Li, Yong-Zhi Zhong, Tim Haye, Francesco Tortorici, Sofia Victoria Prieto, Li Wang, Zi-Jian Song, Jin-Ping Zhang
Trissolcus cultratus, a parasitoid wasp of brown marmorated stink bug (BMSB), exhibits divergent parasitic capacities between Chinese and Swiss populations, with Chinese strains successfully reproducing on fresh and cold storage host eggs in both laboratory and field conditions, while Swiss strains fail to develop in fresh BMSB egg. We sequenced and assembled the first T. cultratus transcriptome, a total of 184,932,102 and 195,101,432 clean reads from the Chinese and Swiss strains, respectively, were de novo assembled into 19,280 and 16,322 unigenes. These assemblies predicted 9,811 and 9,582 protein-coding genes for the two strains. Among the 19,280 and 16,322 unigenes, we further identified 554 and 557 transcription factors in the Chinese and Swiss strains, respectively. This work presents the first transcriptomic dataset for T. cultratus, offering a valuable foundation for subsequent research on its population genetics.
{"title":"Transcriptomic Resource of Trissolcus cultratus: A Key Biological Control Agent for Halyomorpha halys.","authors":"Feng-Qi Li, Yong-Zhi Zhong, Tim Haye, Francesco Tortorici, Sofia Victoria Prieto, Li Wang, Zi-Jian Song, Jin-Ping Zhang","doi":"10.1038/s41597-026-06617-5","DOIUrl":"https://doi.org/10.1038/s41597-026-06617-5","url":null,"abstract":"<p><p>Trissolcus cultratus, a parasitoid wasp of brown marmorated stink bug (BMSB), exhibits divergent parasitic capacities between Chinese and Swiss populations, with Chinese strains successfully reproducing on fresh and cold storage host eggs in both laboratory and field conditions, while Swiss strains fail to develop in fresh BMSB egg. We sequenced and assembled the first T. cultratus transcriptome, a total of 184,932,102 and 195,101,432 clean reads from the Chinese and Swiss strains, respectively, were de novo assembled into 19,280 and 16,322 unigenes. These assemblies predicted 9,811 and 9,582 protein-coding genes for the two strains. Among the 19,280 and 16,322 unigenes, we further identified 554 and 557 transcription factors in the Chinese and Swiss strains, respectively. This work presents the first transcriptomic dataset for T. cultratus, offering a valuable foundation for subsequent research on its population genetics.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146041540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The superfamily Nemouroidea (Plecoptera) represents one of the most diverse and ecologically significant groups of stoneflies, with nymphs serving as crucial bioindicators of freshwater ecosystem health due to their sensitivity to water quality. However, the evolutionary and genomic studies of this group have been hindered by the lack of high-quality reference genomes. Here, we present a chromosome-level genome assembly for Rhopalopsole triangulispina Mo and Li, 2025 within Nemouroidea, generated by integrating PacBio HiFi long reads, Illumina short reads, and Hi-C chromatin interaction data. The final assembly spans 347.119 Mb with a scaffold N50 of 27.479 Mb, and 96.91% (336.39 Mb) of the genome is anchored to 13 pseudochromosomes. BUSCO assessment reveals a high completeness of 98.4% (insecta_odb10). The genome contains 48.50% repetitive elements (168.35 Mb) and encodes 12,857 protein-coding genes, which were comprehensively annotated using homology, transcriptomic, and ab initio evidence. This high-quality genome provides a foundational resource for resolving phylogenetic relationships within Nemouroidea, advancing studies on insect genome evolution, and enhancing freshwater biomonitoring efforts through genomic tools.
{"title":"Chromosome-level genome assembly of the stonefly Rhopalopsole triangulispina Mo and Li, 2025 (Plecoptera: Leuctridae).","authors":"Aili Lin, Jinjun Cao, Dávid Murányi, Ding Yang, Weihai Li, Raorao Mo","doi":"10.1038/s41597-026-06631-7","DOIUrl":"https://doi.org/10.1038/s41597-026-06631-7","url":null,"abstract":"<p><p>The superfamily Nemouroidea (Plecoptera) represents one of the most diverse and ecologically significant groups of stoneflies, with nymphs serving as crucial bioindicators of freshwater ecosystem health due to their sensitivity to water quality. However, the evolutionary and genomic studies of this group have been hindered by the lack of high-quality reference genomes. Here, we present a chromosome-level genome assembly for Rhopalopsole triangulispina Mo and Li, 2025 within Nemouroidea, generated by integrating PacBio HiFi long reads, Illumina short reads, and Hi-C chromatin interaction data. The final assembly spans 347.119 Mb with a scaffold N50 of 27.479 Mb, and 96.91% (336.39 Mb) of the genome is anchored to 13 pseudochromosomes. BUSCO assessment reveals a high completeness of 98.4% (insecta_odb10). The genome contains 48.50% repetitive elements (168.35 Mb) and encodes 12,857 protein-coding genes, which were comprehensively annotated using homology, transcriptomic, and ab initio evidence. This high-quality genome provides a foundational resource for resolving phylogenetic relationships within Nemouroidea, advancing studies on insect genome evolution, and enhancing freshwater biomonitoring efforts through genomic tools.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146041523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1038/s41597-026-06641-5
Wei Feng, Dechao An, Cheinway Hwang, Mingzhi Sun, Xiaodong Chen, Zeyuan Zhang, Meng Yang, Min Zhong
The Surface Water and Ocean Topography (SWOT) mission provides novel perspectives. In this study, a new global bathymetric product, SYSU_Topo, is developed using gravity anomalies (the SWOT_02 model released by Scripps Institution of Oceanography) and the high-precision gravity-geological method (GGM). The data (NetCDF format; global range from 80°S to 80°N, 1-arc-minute resolution; variables: lat, lon, z) and processing codes are openly available for immediate reuse in ocean modeling, geophysics, and seafloor mapping. To reliably obtain the optimal density contrast for GGM, a sliding-window strategy of partition inversion was adopted, and a fusion method with boundary-constraint points is developed to effectively eliminate the splicing effect of partition inversion. The model has been reliably validated with 11,167,583 single-beam bathymetric points and newly added multibeam grid points from GEBCO_2024. The SYSU_Topo model achieves superior performance in the South China Sea, with a standard deviation of 132.07 m, which is 8%-26% better than other models. Compared to traditional altimeter-derived gravity anomalies, SWOT data exhibits greater potential in filling regions lacking high-precision bathymetry.
{"title":"SYSU_Topo: a 1-arc-minute global bathymetry from SWOT-derived gravity using the gravity-geological method.","authors":"Wei Feng, Dechao An, Cheinway Hwang, Mingzhi Sun, Xiaodong Chen, Zeyuan Zhang, Meng Yang, Min Zhong","doi":"10.1038/s41597-026-06641-5","DOIUrl":"https://doi.org/10.1038/s41597-026-06641-5","url":null,"abstract":"<p><p>The Surface Water and Ocean Topography (SWOT) mission provides novel perspectives. In this study, a new global bathymetric product, SYSU_Topo, is developed using gravity anomalies (the SWOT_02 model released by Scripps Institution of Oceanography) and the high-precision gravity-geological method (GGM). The data (NetCDF format; global range from 80°S to 80°N, 1-arc-minute resolution; variables: lat, lon, z) and processing codes are openly available for immediate reuse in ocean modeling, geophysics, and seafloor mapping. To reliably obtain the optimal density contrast for GGM, a sliding-window strategy of partition inversion was adopted, and a fusion method with boundary-constraint points is developed to effectively eliminate the splicing effect of partition inversion. The model has been reliably validated with 11,167,583 single-beam bathymetric points and newly added multibeam grid points from GEBCO_2024. The SYSU_Topo model achieves superior performance in the South China Sea, with a standard deviation of 132.07 m, which is 8%-26% better than other models. Compared to traditional altimeter-derived gravity anomalies, SWOT data exhibits greater potential in filling regions lacking high-precision bathymetry.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146041543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1038/s41597-026-06567-y
Jonas Anderegg, Bruce A McDonald
Time-resolved phenotyping of disease symptoms enables dissection of resistance mechanisms and improves diagnosis, but acquiring phenotypic data at satisfactory scale remains challenging. Advances in imaging and image processing have improved measurement precision, robustness, and throughput, but further improvements are needed for practical application. We present a data set comprising 12,520 high-resolution (~0.03 mm/pixel) RGB images representing 1,032 time series of wheat leaves with developing disease symptoms. All images are geometrically aligned with a median precision of 0.16 mm (≈5 pixels). The dataset includes transformation matrices, symptom segmentation masks, metadata on treatments, weather, crop phenology, and disease occurrence, and a lightweight Python toolkit for loading, aligning, inspecting, and editing image sequences. These resources enable detailed investigation of leaf-level disease dynamics such as lesion, pustule, and fruiting body emergence rates, lesion growth, and dynamic interactions of disease development with spatial and environmental contexts. They offer a broad basis for developing improved methods for image alignment and symptom detection, segmentation, and tracking, possibly by tackling these connected challenges within a single end-to-end framework.
{"title":"High-Resolution Leaf Image Sequences with Geometric Alignment for Dynamic Phenotyping of Foliar Diseases.","authors":"Jonas Anderegg, Bruce A McDonald","doi":"10.1038/s41597-026-06567-y","DOIUrl":"https://doi.org/10.1038/s41597-026-06567-y","url":null,"abstract":"<p><p>Time-resolved phenotyping of disease symptoms enables dissection of resistance mechanisms and improves diagnosis, but acquiring phenotypic data at satisfactory scale remains challenging. Advances in imaging and image processing have improved measurement precision, robustness, and throughput, but further improvements are needed for practical application. We present a data set comprising 12,520 high-resolution (~0.03 mm/pixel) RGB images representing 1,032 time series of wheat leaves with developing disease symptoms. All images are geometrically aligned with a median precision of 0.16 mm (≈5 pixels). The dataset includes transformation matrices, symptom segmentation masks, metadata on treatments, weather, crop phenology, and disease occurrence, and a lightweight Python toolkit for loading, aligning, inspecting, and editing image sequences. These resources enable detailed investigation of leaf-level disease dynamics such as lesion, pustule, and fruiting body emergence rates, lesion growth, and dynamic interactions of disease development with spatial and environmental contexts. They offer a broad basis for developing improved methods for image alignment and symptom detection, segmentation, and tracking, possibly by tackling these connected challenges within a single end-to-end framework.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146041504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1038/s41597-026-06549-0
Yihe Tian, Kwan Man Cheng, Zhengbo Zhang, Tao Zhang, Junning Feng, Zhehao Ren, Suju Li, Dongmei Yan, Bing Xu
Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Current extended VIIRS-like NTL data products suffer from two significant shortcomings: the underestimation of light intensity and the omission of structural details. To overcome these limitations, we present the Extended VIIRS-like Artificial Nighttime Light (EVAL) dataset, a new annual NTL dataset for China spanning from 1986 to 2024. This dataset was generated using a novel two-stage deep learning model designed to address the aforementioned shortcomings. The model first constructs an initial estimate and subsequently refines fine-grained structural details using high-resolution impervious surface data as guidance. Quantitative evaluations demonstrate that EVAL significantly outperforms state-of-the-art products, exhibiting superior temporal consistency and a stronger correlation with socioeconomic indicators.
{"title":"An Extended VIIRS-like Artificial Nighttime Light Data Reconstruction (1986-2024).","authors":"Yihe Tian, Kwan Man Cheng, Zhengbo Zhang, Tao Zhang, Junning Feng, Zhehao Ren, Suju Li, Dongmei Yan, Bing Xu","doi":"10.1038/s41597-026-06549-0","DOIUrl":"https://doi.org/10.1038/s41597-026-06549-0","url":null,"abstract":"<p><p>Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Current extended VIIRS-like NTL data products suffer from two significant shortcomings: the underestimation of light intensity and the omission of structural details. To overcome these limitations, we present the Extended VIIRS-like Artificial Nighttime Light (EVAL) dataset, a new annual NTL dataset for China spanning from 1986 to 2024. This dataset was generated using a novel two-stage deep learning model designed to address the aforementioned shortcomings. The model first constructs an initial estimate and subsequently refines fine-grained structural details using high-resolution impervious surface data as guidance. Quantitative evaluations demonstrate that EVAL significantly outperforms state-of-the-art products, exhibiting superior temporal consistency and a stronger correlation with socioeconomic indicators.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146041528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bloodstream infections (BSIs) of high morbidity and mortality are across all age groups, and urgent for accurate intervention. Gram stain interpretation of positive blood cultures (PBCs) is crucial for early diagnosing BSIs, yet this manual process is labor-intensive, time-consuming, and highly operator-dependent. Artificial intelligence (AI)-assisted microscopic interpretation of stained smears presents beneficial to microbiology diagnostics. Addressing the auto-identification of blood-culture Gram stains, this study introduces a dataset of Gram-stain smears collected in clinical practice. The dataset includes 505 microscopic images, covering up to 57 species associated with BSIs, with a total of 7528 annotations. These annotations categorized by staining characteristics and morphological features into cocci, bacilli, and fungi. We trained and validated an object detection model based on the YOLOv10 architecture on this dataset to automatically localize and classify these morphological categories in microscopic images. The publicly released dataset will help developments that utilize artificial intelligence to auto-interpretate the Gram stains from PBCs for routine clinical application.
{"title":"An annotated dataset of Gram stains from positive blood cultures.","authors":"Qiaolian Yi, Xiaoyan Gou, Renyuan Zhu, Xiuli Xie, Mengting Hu, Xing Wang, Tai'e Wang, Kaiwen Xu, Ying-Chun Xu","doi":"10.1038/s41597-026-06651-3","DOIUrl":"https://doi.org/10.1038/s41597-026-06651-3","url":null,"abstract":"<p><p>Bloodstream infections (BSIs) of high morbidity and mortality are across all age groups, and urgent for accurate intervention. Gram stain interpretation of positive blood cultures (PBCs) is crucial for early diagnosing BSIs, yet this manual process is labor-intensive, time-consuming, and highly operator-dependent. Artificial intelligence (AI)-assisted microscopic interpretation of stained smears presents beneficial to microbiology diagnostics. Addressing the auto-identification of blood-culture Gram stains, this study introduces a dataset of Gram-stain smears collected in clinical practice. The dataset includes 505 microscopic images, covering up to 57 species associated with BSIs, with a total of 7528 annotations. These annotations categorized by staining characteristics and morphological features into cocci, bacilli, and fungi. We trained and validated an object detection model based on the YOLOv10 architecture on this dataset to automatically localize and classify these morphological categories in microscopic images. The publicly released dataset will help developments that utilize artificial intelligence to auto-interpretate the Gram stains from PBCs for routine clinical application.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146041574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although large language models (LLMs) demonstrate significant potential for advancing personalized science education, they face challenges in generating science problem-solving processes adapted to students' grade levels. In this paper, we developed a Chinese Science Question (CSQ) dataset, which comprises both a benchmark and a training set, aiming to evaluate and enhance the science problem-solving capabilities of LLMs. The CSQ consists of 12,000 high-quality samples featuring a variety of question types and diverse discipline properties, covering four subjects and multiple topics at the Chinese primary school. We further designed the language model to reflect these discipline properties in the generated responses, emulating the thought process of students when solving science questions. We demonstrated that CSQ and its extensive annotations can be employed for fine-tuning models. This was confirmed through both automatic and human evaluations, particularly in generating problem-solving processes that are aligned with students' grade levels.
{"title":"A Chinese Elementary Science Question Dataset in Problem-Solving Process Generation.","authors":"Dong Li, Zhi Liu, Chaodong Wen, Jiaxin Guo, Taotao Long, Xian Peng","doi":"10.1038/s41597-026-06618-4","DOIUrl":"https://doi.org/10.1038/s41597-026-06618-4","url":null,"abstract":"<p><p>Although large language models (LLMs) demonstrate significant potential for advancing personalized science education, they face challenges in generating science problem-solving processes adapted to students' grade levels. In this paper, we developed a Chinese Science Question (CSQ) dataset, which comprises both a benchmark and a training set, aiming to evaluate and enhance the science problem-solving capabilities of LLMs. The CSQ consists of 12,000 high-quality samples featuring a variety of question types and diverse discipline properties, covering four subjects and multiple topics at the Chinese primary school. We further designed the language model to reflect these discipline properties in the generated responses, emulating the thought process of students when solving science questions. We demonstrated that CSQ and its extensive annotations can be employed for fine-tuning models. This was confirmed through both automatic and human evaluations, particularly in generating problem-solving processes that are aligned with students' grade levels.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a comprehensive dataset of absorption and reduced scattering spectra collected via time-domain diffuse optical spectroscopy in the 610-1110 nm range, across 10 subjects and on 5 different body locations - the upper arm, the radius-ulna region, the abdomen, the forehead, and the calcaneus. The ultrasound images acquired in the same location are included as well, and along with the demographic information shed useful insights on the inter-subject variability. The dataset, openly available in Zenodo, contains the raw data, the meta data, the tools to operate on them, and can be exploited to devise light-based diagnostics or therapeutic techniques, to appreciate biological variability, and also to test different models of photon migration.
{"title":"In-vivo optical properties spectra across five body locations on ten subjects using time-domain diffuse optics.","authors":"Vamshi Damagatla, Siënna Karremans, Alessandro Bossi, Edoardo Martinenghi, Srirang Manohar, Paola Taroni, Rinaldo Cubeddu, Antonio Pifferi, Ilaria Bargigia","doi":"10.1038/s41597-026-06586-9","DOIUrl":"https://doi.org/10.1038/s41597-026-06586-9","url":null,"abstract":"<p><p>We present a comprehensive dataset of absorption and reduced scattering spectra collected via time-domain diffuse optical spectroscopy in the 610-1110 nm range, across 10 subjects and on 5 different body locations - the upper arm, the radius-ulna region, the abdomen, the forehead, and the calcaneus. The ultrasound images acquired in the same location are included as well, and along with the demographic information shed useful insights on the inter-subject variability. The dataset, openly available in Zenodo, contains the raw data, the meta data, the tools to operate on them, and can be exploited to devise light-based diagnostics or therapeutic techniques, to appreciate biological variability, and also to test different models of photon migration.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper describes the dataset for a deeper evaluation of the machine learning models for handwritten character recognition. For that purpose, we build a dataset that, combined with existing NIST Databases, offers possibilities for additional analysis of the models built on these data. The paper summarizes the most popular publicly available machine learning models, trained on the EMNIST-letters dataset. We discuss issues related to the evaluation of state-of-the-art results that have been made by comparing accuracy achieved on the test set built in cross-validation setting. We propose additional evaluation on new, independently constructed data, unaffiliated with the NIST database authors. The dataset and source codes have been made available using Gdansk Tech University repository Most Wiedzy.
{"title":"The dataset for extending EMNIST evaluation.","authors":"Julian Szymański, Kacper Skarżyński, Błażej Szutenberg, Klaudia Ratkowska, Szymon Drywa","doi":"10.1038/s41597-025-06291-z","DOIUrl":"10.1038/s41597-025-06291-z","url":null,"abstract":"<p><p>The paper describes the dataset for a deeper evaluation of the machine learning models for handwritten character recognition. For that purpose, we build a dataset that, combined with existing NIST Databases, offers possibilities for additional analysis of the models built on these data. The paper summarizes the most popular publicly available machine learning models, trained on the EMNIST-letters dataset. We discuss issues related to the evaluation of state-of-the-art results that have been made by comparing accuracy achieved on the test set built in cross-validation setting. We propose additional evaluation on new, independently constructed data, unaffiliated with the NIST database authors. The dataset and source codes have been made available using Gdansk Tech University repository Most Wiedzy.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"13 1","pages":"73"},"PeriodicalIF":6.9,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12827260/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1038/s41597-026-06588-7
Pan Wang, Xinyue Wang, Denghua Yin, Jie Liu, Min Jiang, Kai Liu
Opsariichthys evolans is a stream-dwelling fish species endemic to China, with its primary distribution encompassing southeastern China and Taiwan. Initially classified under the genus Zacco, it was later reclassified into the genus Opsariichthys based primarily on mitochondrial DNA evidence. However, this taxonomic revision remains partially inconclusive due to the absence of whole-genome data. Therefore, we assembled a telomere-to-telomere, gap-free genome assembly of O. evolans, consisting of 39 chromosomes with one contiguous sequence per chromosome. The assembly had a total size of 886.9 Mb and a contig N50 of 25.44 Mb. The presence of the telomere repeat was clearly confirmed in the genome. BUSCO assessment confirmed 99.34% genome completeness. Collinearity analysis revealed high synteny between O. evolans, O. bidens and Zacco platypus. Genomic comparisons revealed key candidate genes and related biological pathways potentially responsible for color patterning and hydrodynamic adaptation. The complete O. evolans genome provides insights into its genome structure and function, and supports the taxonomic reclassification between the genera Opsariichthys and Zacco.
{"title":"Telomere-to-telomere gap-free genome assembly of the Opsariichthys evolans (Cypriniformes: Cyprinidae).","authors":"Pan Wang, Xinyue Wang, Denghua Yin, Jie Liu, Min Jiang, Kai Liu","doi":"10.1038/s41597-026-06588-7","DOIUrl":"https://doi.org/10.1038/s41597-026-06588-7","url":null,"abstract":"<p><p>Opsariichthys evolans is a stream-dwelling fish species endemic to China, with its primary distribution encompassing southeastern China and Taiwan. Initially classified under the genus Zacco, it was later reclassified into the genus Opsariichthys based primarily on mitochondrial DNA evidence. However, this taxonomic revision remains partially inconclusive due to the absence of whole-genome data. Therefore, we assembled a telomere-to-telomere, gap-free genome assembly of O. evolans, consisting of 39 chromosomes with one contiguous sequence per chromosome. The assembly had a total size of 886.9 Mb and a contig N50 of 25.44 Mb. The presence of the telomere repeat was clearly confirmed in the genome. BUSCO assessment confirmed 99.34% genome completeness. Collinearity analysis revealed high synteny between O. evolans, O. bidens and Zacco platypus. Genomic comparisons revealed key candidate genes and related biological pathways potentially responsible for color patterning and hydrodynamic adaptation. The complete O. evolans genome provides insights into its genome structure and function, and supports the taxonomic reclassification between the genera Opsariichthys and Zacco.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":" ","pages":""},"PeriodicalIF":6.9,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}