Pub Date : 2026-04-01Epub Date: 2026-02-05DOI: 10.1016/j.dib.2026.112532
Sam D. Heraghty, Aijun Zhang, Daniel Kuhar, Dawn E. Gundersen-Rindal, Michael E. Sparks
The cotton seed bug, Oxycarenus hyalinipennis, is an agricultural pest that has recently been detected in the United States and has the potential to cause extensive economic damage to the cotton production industry. Currently, there are no transcriptomic resources for this species. The data reported here will serve to help guide future efforts to create additional reference resources as well as facilitate the development of population control strategies. These data could also be of use towards identifying protein coding genes in a cotton seed bug genome assembly. A total of 13,384 differentially expressed genes was identified, which collectively encoded 40,871 distinct transcripts, of which 18,842 could be annotated with a reference protein in the NCBI NR database, 13,233 with Pfam protein families and 8,089 with GO Gene Ontology terms. These transcripts could, for example, be targeted for future functional genomics work.
{"title":"A transcriptome sequence dataset characterizing eggs, nymphs and adults of Oxycarenus hyalinipennis, the cotton seed bug","authors":"Sam D. Heraghty, Aijun Zhang, Daniel Kuhar, Dawn E. Gundersen-Rindal, Michael E. Sparks","doi":"10.1016/j.dib.2026.112532","DOIUrl":"10.1016/j.dib.2026.112532","url":null,"abstract":"<div><div>The cotton seed bug, <em>Oxycarenus hyalinipennis,</em> is an agricultural pest that has recently been detected in the United States and has the potential to cause extensive economic damage to the cotton production industry. Currently, there are no transcriptomic resources for this species. The data reported here will serve to help guide future efforts to create additional reference resources as well as facilitate the development of population control strategies. These data could also be of use towards identifying protein coding genes in a cotton seed bug genome assembly. A total of 13,384 differentially expressed genes was identified, which collectively encoded 40,871 distinct transcripts, of which 18,842 could be annotated with a reference protein in the NCBI NR database, 13,233 with Pfam protein families and 8,089 with GO Gene Ontology terms. These transcripts could, for example, be targeted for future functional genomics work.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112532"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-02-06DOI: 10.1016/j.dib.2026.112547
Daniel Jones , Praful Aggarwal , Jamison Trewyn , Poojhaa Shanmugam , Kyle Leistikow , Troy Skwor
While investigating foodstuffs for ESBL-producing Aeromonas species on ampicillin dextrin agar with vancomycin and cefotaxime, a multidrug-resistant Empedobacter brevis strain GBW-1 was identified from ground beef. Phylogenetic analysis supports the interconnectedness of environment, humans and food driving this species' evolutionary development. Antimicrobial susceptibility testing demonstrated resistance to gentamicin, carbapenems and third-generation cephalosporins. Data collection from whole genome sequencing of this strain detected a 3.74 Mb genome with 32.8% GC content containing 3780 coding genes. Among these genes, at least three known antimicrobial resistance (AMR) genes were identified from the dataset with qacG, vanT gene within the vanG cluster, and a novel variant of the metallo-β-lactamase blaEBR-6. This homologue, EBR-6, was compared against previously known EBR variants and was found to be closest to EBR-3 with an 84.98% amino acid identity match. Data collection from in silico molecular docking experiments predicted these mutations change the binding to meropenem. Furthermore, nearly 100 annotated regions associated with mobile genetic elements, including the presence of tra operons, were identified on the genome. Together, this dataset provides, genomic, phenotypic, and in silico data that may be reused to monitor the evolution of EBR from a One Health perspective.
{"title":"Whole genome sequencing data analysis identified a cefotaxime-resistant Empedobacter brevis GBW-1 isolate from ground beef encoding a novel metallo-beta-lactamase variant, blaEBR-6","authors":"Daniel Jones , Praful Aggarwal , Jamison Trewyn , Poojhaa Shanmugam , Kyle Leistikow , Troy Skwor","doi":"10.1016/j.dib.2026.112547","DOIUrl":"10.1016/j.dib.2026.112547","url":null,"abstract":"<div><div>While investigating foodstuffs for ESBL-producing <em>Aeromonas</em> species on ampicillin dextrin agar with vancomycin and cefotaxime, a multidrug-resistant <em>Empedobacter brevis</em> strain GBW-1 was identified from ground beef. Phylogenetic analysis supports the interconnectedness of environment, humans and food driving this species' evolutionary development. Antimicrobial susceptibility testing demonstrated resistance to gentamicin, carbapenems and third-generation cephalosporins. Data collection from whole genome sequencing of this strain detected a 3.74 Mb genome with 32.8% GC content containing 3780 coding genes. Among these genes, at least three known antimicrobial resistance (AMR) genes were identified from the dataset with <em>qacG, vanT</em> gene within the <em>vanG</em> cluster, and a novel variant of the metallo-β-lactamase <em>bla</em><sub>EBR-6</sub>. This homologue, EBR-6, was compared against previously known EBR variants and was found to be closest to EBR-3 with an 84.98% amino acid identity match. Data collection from <em>in silico</em> molecular docking experiments predicted these mutations change the binding to meropenem. Furthermore, nearly 100 annotated regions associated with mobile genetic elements, including the presence of <em>tra</em> operons, were identified on the genome. Together, this dataset provides, genomic, phenotypic, and <em>in</em> silico data that may be reused to monitor the evolution of EBR from a One Health perspective.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112547"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-29DOI: 10.1016/j.dib.2026.112514
Paola Marques, Mariana Mendes, Thiago Emmanuel Pereira, Giovanni Farias
While public cloud providers dominate the commercial landscape, private clouds are widely adopted by academic and research institutions to meet specific governance and operational requirements. There are multiple available datasets about resource usage of public clouds; however, datasets capturing usage patterns in private clouds remain scarce, which limits research in this area. This work presents a dataset comprising over 64 million records collected from a private OpenStack-based cloud operated by the Distributed Systems Laboratory at the Federal University of Campina Grande, Brazil. Data was continuously gathered over nearly twelve months (May 23, 2024 to May 16, 2025), periodically querying OpenStack APIs and monitoring services every five minutes. The dataset captures different aspects of the infrastructure, allocation quotas, user-to-project associations (as OpenStack groups users into projects), server (virtual machines) specifications, and resource utilization for users and projects. Entries are timestamped, enabling temporal analyses of system dynamics. Sensitive attributes, such as user names, project names, IP addresses, and server names were protected, leaving only system-generated UUIDs. By offering a detailed, time-stamped, view of a private cloud, this dataset provides a valuable resource for cloud computing research, helping to bridge the gap in publicly available datasets from non-commercial cloud environments. The dataset is valuable not only for academic institutions but also for companies considering cloud repatriation.
{"title":"Dataset on resource allocation and usage for a private cloud","authors":"Paola Marques, Mariana Mendes, Thiago Emmanuel Pereira, Giovanni Farias","doi":"10.1016/j.dib.2026.112514","DOIUrl":"10.1016/j.dib.2026.112514","url":null,"abstract":"<div><div>While public cloud providers dominate the commercial landscape, private clouds are widely adopted by academic and research institutions to meet specific governance and operational requirements. There are multiple available datasets about resource usage of public clouds; however, datasets capturing usage patterns in private clouds remain scarce, which limits research in this area. This work presents a dataset comprising over 64 million records collected from a private OpenStack-based cloud operated by the Distributed Systems Laboratory at the Federal University of Campina Grande, Brazil. Data was continuously gathered over nearly twelve months (May 23, 2024 to May 16, 2025), periodically querying OpenStack APIs and monitoring services every five minutes. The dataset captures different aspects of the infrastructure, allocation quotas, user-to-project associations (as OpenStack groups users into projects), server (virtual machines) specifications, and resource utilization for users and projects. Entries are timestamped, enabling temporal analyses of system dynamics. Sensitive attributes, such as user names, project names, IP addresses, and server names were protected, leaving only system-generated UUIDs. By offering a detailed, time-stamped, view of a private cloud, this dataset provides a valuable resource for cloud computing research, helping to bridge the gap in publicly available datasets from non-commercial cloud environments. The dataset is valuable not only for academic institutions but also for companies considering cloud repatriation.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112514"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article introduces Agri-Vision Bangladesh, a comprehensive, augmented image dataset designed to advance automated disease diagnosis in four economically vital agricultural crops: Bottle Gourd (Lagenaria siceraria), Zucchini (Cucurbita pepo), Papaya (Carica papaya), and Tomato (Solanum lycopersicum). Addressing the scarcity of region-specific agricultural data, a total of 5266 original images were acquired directly from diverse agricultural fields in Bangladesh using a SONY ALPHA 7 II full-frame camera under natural lighting conditions. The dataset encompasses 28 distinct classes, covering a wide spectrum of biotic stressors including viral (Mosaic Virus, Leaf Curl), fungal (Downy Mildew, Anthracnose, Alternaria Blight), bacterial (Bacterial Blight, Xanthomonas), and pest-induced damage (Insect Hole, White Spot), alongside Healthy samples. To ensure scientific reliability, each image underwent a rigorous two-stage validation process by senior agronomists. To tackle class imbalance and facilitate the training of data-intensive Deep Learning models, the dataset was expanded using a Python-based augmentation pipeline incorporating geometric transformations (rotation, flipping) and photometric adjustments (noise, brightness) resulting in a final repository of 28,000 images (5266 original and 22,734 augmented). All files are standardized to 512×512 pixels in JPG format. This expert-validated resource serves as a critical benchmark for developing robust computer vision algorithms (e.g., CNNs, Vision Transformers) for precision agriculture, enabling research into fine-grained classification, object detection, and cross-crop transfer learning in subtropical farming environments.
{"title":"Agri-vision Bangladesh: A multi-crop augmented image dataset for automated disease diagnosis in Bottle Gourd, Zucchini, Papaya, and Tomato","authors":"Md Masum Billah , Md. Anisur Rahman , Saifuddin Sagor , Sanzida Parvin , Mohammad Shorif Uddin","doi":"10.1016/j.dib.2026.112528","DOIUrl":"10.1016/j.dib.2026.112528","url":null,"abstract":"<div><div>This article introduces Agri-Vision Bangladesh, a comprehensive, augmented image dataset designed to advance automated disease diagnosis in four economically vital agricultural crops: Bottle Gourd (<em>Lagenaria siceraria</em>), Zucchini (<em>Cucurbita pepo</em>), Papaya (Carica papaya), and Tomato (<em>Solanum lycopersicum</em>). Addressing the scarcity of region-specific agricultural data, a total of 5266 original images were acquired directly from diverse agricultural fields in Bangladesh using a SONY ALPHA 7 II full-frame camera under natural lighting conditions. The dataset encompasses 28 distinct classes, covering a wide spectrum of biotic stressors including viral (Mosaic Virus, Leaf Curl), fungal (Downy Mildew, Anthracnose, Alternaria Blight), bacterial (Bacterial Blight, Xanthomonas), and pest-induced damage (Insect Hole, White Spot), alongside Healthy samples. To ensure scientific reliability, each image underwent a rigorous two-stage validation process by senior agronomists. To tackle class imbalance and facilitate the training of data-intensive Deep Learning models, the dataset was expanded using a Python-based augmentation pipeline incorporating geometric transformations (rotation, flipping) and photometric adjustments (noise, brightness) resulting in a final repository of 28,000 images (5266 original and 22,734 augmented). All files are standardized to 512×512 pixels in JPG format. This expert-validated resource serves as a critical benchmark for developing robust computer vision algorithms (e.g., CNNs, Vision Transformers) for precision agriculture, enabling research into fine-grained classification, object detection, and cross-crop transfer learning in subtropical farming environments.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112528"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-13DOI: 10.1016/j.dib.2026.112468
Claudia Moricca , Rachele Nicolini , Lucrezia Masci , Lia Barelli , Simona Morretta , Raffaele Pugliese , Laura Sadori
<div><div>The “Santi Quattro Coronati – archaeobotanical plates” dataset presents a comprehensive photographic collection of carpological remains recovered from a pit in the complex of Santi Quattro Coronati (Rome, Italy). The deposit, dated between the late 15th and the mid-16th century, yielded a diverse assemblage of desiccated plant remains. The dataset is novel in that it provides the complete photographic documentation of all identified taxa from a single Early Modern archaeological context, a chronological phase that remains underrepresented in Italian archaeobotanical research.</div><div>The photographic documentation focuses on a representative sample of each taxon identified in the archaeobotanical analysis, with particular attention to the best-preserved specimens. When multiple plant parts of the same taxon were present, all were included. The dataset also includes fragile and rarely illustrated plant parts, such as cereal rachis fragments, tunics and basal plates of onion and garlic, grapevine tendrils and legume seed coats. These are often excluded from reference atlases due to their low archaeological survivability and the consequent scarcity of well-preserved comparative specimens.</div><div>High-resolution images were acquired using a Leica MC205C stereomicroscope equipped with a Leica IC80HD camera and the Leica Application Suite v.4.5.0 software. Illumination was provided by the Leica LED5000 HDI™ dome system, ensuring constant, diffuse light conditions. A column of images was captured for each specimen and processed with Helicon Focus v.7.0.1 Pro through focus stacking to obtain a single fully focused image. Depending on specimen size and complexity, between 9 and 127 photographs were used per perspective. Larger samples, unsuitable for microscopic observation, were photographed using a Canon digital camera under controlled illumination. Post-processing was performed with GIMP, applying standard tools for background cleaning and masking. Each final plate includes a scale bar for size reference.</div><div>The dataset is organized alphabetically by plant family and taxon. For each taxon, one or more plates are provided, displaying specimens from one to three perspectives to represent their 3D morphology. Nomenclature follows the taxonomy used in the original publication of the assemblage and has been updated according to the most recent checklist of the Italian vascular flora. A metadata .xls file is provided to facilitate consultation, reuse, comparison and integration with other archaeobotanical datasets.</div><div>This dataset offers a well-documented comparative visual reference for species/genus identification and for assessing the preservation state and morphological integrity of desiccated archaeobotanical remains. Offering detailed photographic records of New World plant taxa previously identified in this context, the study enhances accessibility and understanding of these materials through visual reference. Despite bein
“Santi Quattro Coronati -考古植物板块”数据集展示了从意大利罗马的Santi Quattro Coronati复合体的一个坑中恢复的人类学遗骸的综合摄影集合。该矿床的历史可以追溯到15世纪晚期到16世纪中期,发现了各种各样的干枯植物遗骸。该数据集的新颖之处在于,它提供了来自单一的早期现代考古背景的所有已识别分类群的完整照片文档,这是意大利考古植物学研究中尚未充分代表的时间顺序阶段。摄影文献集中于考古植物学分析中确定的每个分类单元的代表性样本,特别关注保存最完好的标本。当同一分类单元的多个植物部分存在时,所有部分都被包括在内。该数据集还包括易碎且很少展示的植物部分,如谷物轴片、洋葱和大蒜的外衣和基板、葡萄藤卷须和豆类种皮。由于它们的考古存续能力较低,因而缺乏保存完好的比较标本,因此经常被排除在参考地图集之外。使用配备徕卡IC80HD相机的徕卡MC205C立体显微镜和Leica Application Suite v.4.5.0软件获取高分辨率图像。照明由徕卡LED5000 HDI™穹顶系统提供,确保恒定的漫射光条件。每个标本采集一列图像,用Helicon Focus v.7.0.1 Pro进行对焦叠加处理,得到一张完全聚焦的图像。根据标本的大小和复杂程度,每个视角使用9到127张照片。较大的样本,不适合显微镜观察,使用佳能数码相机在受控照明下拍摄。使用GIMP进行后处理,使用标准工具进行背景清理和遮盖。每个最终板包括一个比例尺的尺寸参考。数据集按植物科和分类单元的字母顺序组织。对于每个分类单元,提供一个或多个板,从一个到三个角度展示标本,以表示它们的三维形态。命名法遵循汇编原始出版物中使用的分类法,并根据意大利维管植物区系的最新清单进行了更新。提供了一个元数据。xls文件,以便与其他考古植物数据集进行查阅、重用、比较和集成。该数据集为物种/属鉴定和评估干燥考古植物遗骸的保存状态和形态完整性提供了一个有充分记录的比较视觉参考。该研究提供了在此背景下发现的新大陆植物分类群的详细照片记录,通过视觉参考提高了对这些材料的可及性和理解。尽管受到单一背景的限制,该数据集代表了考古植物学的最佳实践,鼓励其他研究人员分享他们所研究的人类学组合的完整照片文档,从而支持开放科学和逐步构建扩展的视觉参考集合。该数据集主要用于研究早期现代背景的考古植物学家和环境考古学家,但它也可以为研究其他年代和地点的干枯植物遗骸的研究人员提供服务。
{"title":"Visualizing archaeobotanical data: A comprehensive photographic record of desiccated plant remains from an early modern context at Santi Quattro Coronati, Rome","authors":"Claudia Moricca , Rachele Nicolini , Lucrezia Masci , Lia Barelli , Simona Morretta , Raffaele Pugliese , Laura Sadori","doi":"10.1016/j.dib.2026.112468","DOIUrl":"10.1016/j.dib.2026.112468","url":null,"abstract":"<div><div>The “Santi Quattro Coronati – archaeobotanical plates” dataset presents a comprehensive photographic collection of carpological remains recovered from a pit in the complex of Santi Quattro Coronati (Rome, Italy). The deposit, dated between the late 15th and the mid-16th century, yielded a diverse assemblage of desiccated plant remains. The dataset is novel in that it provides the complete photographic documentation of all identified taxa from a single Early Modern archaeological context, a chronological phase that remains underrepresented in Italian archaeobotanical research.</div><div>The photographic documentation focuses on a representative sample of each taxon identified in the archaeobotanical analysis, with particular attention to the best-preserved specimens. When multiple plant parts of the same taxon were present, all were included. The dataset also includes fragile and rarely illustrated plant parts, such as cereal rachis fragments, tunics and basal plates of onion and garlic, grapevine tendrils and legume seed coats. These are often excluded from reference atlases due to their low archaeological survivability and the consequent scarcity of well-preserved comparative specimens.</div><div>High-resolution images were acquired using a Leica MC205C stereomicroscope equipped with a Leica IC80HD camera and the Leica Application Suite v.4.5.0 software. Illumination was provided by the Leica LED5000 HDI™ dome system, ensuring constant, diffuse light conditions. A column of images was captured for each specimen and processed with Helicon Focus v.7.0.1 Pro through focus stacking to obtain a single fully focused image. Depending on specimen size and complexity, between 9 and 127 photographs were used per perspective. Larger samples, unsuitable for microscopic observation, were photographed using a Canon digital camera under controlled illumination. Post-processing was performed with GIMP, applying standard tools for background cleaning and masking. Each final plate includes a scale bar for size reference.</div><div>The dataset is organized alphabetically by plant family and taxon. For each taxon, one or more plates are provided, displaying specimens from one to three perspectives to represent their 3D morphology. Nomenclature follows the taxonomy used in the original publication of the assemblage and has been updated according to the most recent checklist of the Italian vascular flora. A metadata .xls file is provided to facilitate consultation, reuse, comparison and integration with other archaeobotanical datasets.</div><div>This dataset offers a well-documented comparative visual reference for species/genus identification and for assessing the preservation state and morphological integrity of desiccated archaeobotanical remains. Offering detailed photographic records of New World plant taxa previously identified in this context, the study enhances accessibility and understanding of these materials through visual reference. Despite bein","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112468"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146036501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sika deer (Cervus nippon) is naturally distributed across East Asia and includes 14 subspecies, showing phenotypic and genetic diversity. In this study, we constructed a de novo genome assembly of wild sika deer using one of the largest subspecies, C. n. yesoensis. We used HiFi, high quality long-read based on Pacific Bioscience to assemble our novel genome assembly CerNipYes1.0. The genome size of CerNipYes1.0 is estimated to be 3.1Gb, which is 0.6Gb larger than the other genome assembly of sika deer previously reported. The number of scaffolds is 1,810 and N50 length achieved 77Mb. Compleasm, a genome completeness evaluation tool based on Benchmarking Universal Single-Copy Orthologs (BUSCO) indicated that 12,562 (99.75%) genes are completed as genes with comparing to database. Our results indicate that CerNipYes1.0 is valuable to study the molecular biology, phylogeny and evolution of the Cervidae and its genome.
梅花鹿(Cervus nippon)自然分布于东亚地区,包括14个亚种,表现出表型和遗传多样性。在这项研究中,我们利用野生梅花鹿最大的亚种之一C. n. yesoensis构建了一个全新的基因组组装。我们使用HiFi,高质量的长读基于太平洋生物科学组装我们的新基因组组装CerNipYes1.0。CerNipYes1.0的基因组大小估计为3.1Gb,比先前报道的其他梅花鹿基因组大0.6Gb。支架数量为1810个,N50长度达到77Mb。基于BUSCO (Benchmarking Universal Single-Copy Orthologs)的基因组完整性评估工具Compleasm表明,与数据库比较,有12562个(99.75%)基因被完成为基因。结果表明,CerNipYes1.0在研究蛇科动物及其基因组的分子生物学、系统发育和进化方面具有重要的应用价值。
{"title":"A reference-grade genome assembly data of sika deer in Hokkaido, Japan, Cervus nippon yesoensis","authors":"Yuki Matsumoto , Junco Nagata , Yukiko Matsuura , Hayato Iijima","doi":"10.1016/j.dib.2025.112423","DOIUrl":"10.1016/j.dib.2025.112423","url":null,"abstract":"<div><div>Sika deer (<em>Cervus nippon</em>) is naturally distributed across East Asia and includes 14 subspecies, showing phenotypic and genetic diversity. In this study, we constructed a de novo genome assembly of wild sika deer using one of the largest subspecies, <em>C. n. yesoensis</em>. We used HiFi, high quality long-read based on Pacific Bioscience to assemble our novel genome assembly CerNipYes1.0. The genome size of CerNipYes1.0 is estimated to be 3.1Gb, which is 0.6Gb larger than the other genome assembly of sika deer previously reported. The number of scaffolds is 1,810 and N50 length achieved 77Mb. Compleasm, a genome completeness evaluation tool based on Benchmarking Universal Single-Copy Orthologs (BUSCO) indicated that 12,562 (99.75%) genes are completed as genes with comparing to database. Our results indicate that CerNipYes1.0 is valuable to study the molecular biology, phylogeny and evolution of the Cervidae and its genome.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112423"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146075244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-21DOI: 10.1016/j.dib.2026.112490
Abu Kowshir Bitto , Md. Zahid Hasan , Md. Hasan Imam Bijoy , Khalid Been Badruzzaman Biplob , Mohammad Mahadi Hassan , Mohammad Shohel Rana , Abdul Kadar Muhammad Masum
Brinjal (Solanum melongena) or eggplant is one of the four most essential vegetable crops that are grown in Bangladesh and contribute significantly to the agricultural industry of the country. Brinjal supports the livelihood of numerous small farmers; however, brinjal is severely susceptible to various fruit diseases, which have serious impacts on yield quality and may cause considerable economic losses. While most existing plant disease datasets primarily focus on leaf-related disorders, only a limited number include fruit-related diseases and even those contain very few classes. This gap is significant because fruit diseases directly affect crop quality, market value, and overall yield. This is why we present here a new and comprehensive dataset that is unparalleled, exclusively for brinjal fruit diseases. This data set consists of 1823 high-quality, labelled images, across five distinct classes: Phomopsis Blight, Shoot and Fruit Borer, Fruit Cracking, Wet Rot, and Healthy Fruit. The images were collected from real farm conditions in numerous areas of Bangladesh to ensure a robust sample of varied environmental and farming practices impacting the growth of diseases. This dataset is designed with the unique aim to support plant disease research and enhance training of deep learning models for autonomous disease detection. Lastly, the dataset will allow early disease detection, enhancing crop management practice, reduction of losses, and increasing farmers' economic returns. The release of this dataset will encourage agricultural research as well as practical use in precision agriculture.
{"title":"BrinjalFruitX: A field-collected image dataset for machine learning and deep learning-based disease identification in brinjal fruits","authors":"Abu Kowshir Bitto , Md. Zahid Hasan , Md. Hasan Imam Bijoy , Khalid Been Badruzzaman Biplob , Mohammad Mahadi Hassan , Mohammad Shohel Rana , Abdul Kadar Muhammad Masum","doi":"10.1016/j.dib.2026.112490","DOIUrl":"10.1016/j.dib.2026.112490","url":null,"abstract":"<div><div>Brinjal (Solanum melongena) or eggplant is one of the four most essential vegetable crops that are grown in Bangladesh and contribute significantly to the agricultural industry of the country. Brinjal supports the livelihood of numerous small farmers; however, brinjal is severely susceptible to various fruit diseases, which have serious impacts on yield quality and may cause considerable economic losses. While most existing plant disease datasets primarily focus on leaf-related disorders, only a limited number include fruit-related diseases and even those contain very few classes. This gap is significant because fruit diseases directly affect crop quality, market value, and overall yield. This is why we present here a new and comprehensive dataset that is unparalleled, exclusively for brinjal fruit diseases. This data set consists of 1823 high-quality, labelled images, across five distinct classes: Phomopsis Blight, Shoot and Fruit Borer, Fruit Cracking, Wet Rot, and Healthy Fruit. The images were collected from real farm conditions in numerous areas of Bangladesh to ensure a robust sample of varied environmental and farming practices impacting the growth of diseases. This dataset is designed with the unique aim to support plant disease research and enhance training of deep learning models for autonomous disease detection. Lastly, the dataset will allow early disease detection, enhancing crop management practice, reduction of losses, and increasing farmers' economic returns. The release of this dataset will encourage agricultural research as well as practical use in precision agriculture.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112490"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-28DOI: 10.1016/j.dib.2026.112512
Mehedi Hasan
Educational data mining and learning analytics have become important research areas for supporting pedagogical analysis, algorithm development, and privacy-preserving educational research. The advancement of natural language processing (NLP) methods in educational contexts depends on the availability of structured and well-documented textual datasets; however, access to real student data is often restricted due to ethical, legal, and privacy concerns. This article presents a fully synthetic textual dataset of student learning habits and preferences generated using a large language model (LLM). The dataset contains 10,000 CSV-formatted records representing fictional students and includes attributes such as education level, study hours, preferred learning methods, learning challenges, motivation levels, opinions on online learning, and primary devices used for study. Data generation was performed using structured prompting strategies with explicitly defined controlled vocabularies to ensure internal consistency and reproducibility while avoiding the use of any real personal information. The resulting dataset follows intentionally controlled and near-uniform distributions, with variables generated under independent constraints. This design limits its suitability for modelling real-world stochastic behaviour or discovering natural correlations but makes it appropriate for benchmarking educational NLP pipelines, evaluating synthetic data generation techniques, and conducting privacy-preserving survey and machine learning experiments.
{"title":"A fully synthetic textual dataset of student learning habits and preferences generated using a large language model","authors":"Mehedi Hasan","doi":"10.1016/j.dib.2026.112512","DOIUrl":"10.1016/j.dib.2026.112512","url":null,"abstract":"<div><div>Educational data mining and learning analytics have become important research areas for supporting pedagogical analysis, algorithm development, and privacy-preserving educational research. The advancement of natural language processing (NLP) methods in educational contexts depends on the availability of structured and well-documented textual datasets; however, access to real student data is often restricted due to ethical, legal, and privacy concerns. This article presents a fully synthetic textual dataset of student learning habits and preferences generated using a large language model (LLM). The dataset contains 10,000 CSV-formatted records representing fictional students and includes attributes such as education level, study hours, preferred learning methods, learning challenges, motivation levels, opinions on online learning, and primary devices used for study. Data generation was performed using structured prompting strategies with explicitly defined controlled vocabularies to ensure internal consistency and reproducibility while avoiding the use of any real personal information. The resulting dataset follows intentionally controlled and near-uniform distributions, with variables generated under independent constraints. This design limits its suitability for modelling real-world stochastic behaviour or discovering natural correlations but makes it appropriate for benchmarking educational NLP pipelines, evaluating synthetic data generation techniques, and conducting privacy-preserving survey and machine learning experiments.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112512"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-28DOI: 10.1016/j.dib.2026.112500
Robin J. Pakeman
Numerous approaches have been used to assess the response of species to changing climate. One of the simplest is the calculation of indices which describe the climate of areas occupied by different species and uses them to assess community level change or to assess if species’ trends are predictable from the climate of their ranges. The paper describes the calculation of Species Climate Indices for 4924 UK invertebrate species from freshwater and terrestrial ecosystem by combining information from occurrence records and historical climate data. The indices calculated are the mean January temperature, mean July temperature and mean annual precipitation of 10 km x 10 km squares occupied by the species during the period used for calculating the climate data (1991–2020). These data have been used to assess if trends in occupancy are correlated to species’ climate indices [1] but are also ideally used for looking at trends within communities if repeat sampling has been carried out.
许多方法被用来评估物种对气候变化的反应。其中最简单的一种是计算描述不同物种所占地区气候的指数,并用它们来评估群落水平的变化,或评估物种的趋势是否可以从其范围的气候预测。本文结合发生记录和历史气候资料,计算了英国淡水和陆地生态系统中4924种无脊椎动物的物种气候指数。计算的指数为计算气候资料所用期间(1991-2020年)各物种所占10 km × 10 km平方的1月平均气温、7月平均气温和年平均降水量。这些数据被用来评估占用趋势是否与物种的气候指数[1]相关,但如果进行了重复采样,也可以用来观察群落内的趋势。
{"title":"Species climate index data for United Kingdom invertebrates","authors":"Robin J. Pakeman","doi":"10.1016/j.dib.2026.112500","DOIUrl":"10.1016/j.dib.2026.112500","url":null,"abstract":"<div><div>Numerous approaches have been used to assess the response of species to changing climate. One of the simplest is the calculation of indices which describe the climate of areas occupied by different species and uses them to assess community level change or to assess if species’ trends are predictable from the climate of their ranges. The paper describes the calculation of Species Climate Indices for 4924 UK invertebrate species from freshwater and terrestrial ecosystem by combining information from occurrence records and historical climate data. The indices calculated are the mean January temperature, mean July temperature and mean annual precipitation of 10 km x 10 km squares occupied by the species during the period used for calculating the climate data (1991–2020). These data have been used to assess if trends in occupancy are correlated to species’ climate indices [<span><span>1</span></span>] but are also ideally used for looking at trends within communities if repeat sampling has been carried out.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112500"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-15DOI: 10.1016/j.dib.2026.112469
Amged O. Abdelatif, Abdelrahim H. Abdelrahim, Gamar-Aldwla S. Shangray, Mohammed-Alfatih Mustafa, Mustafa M. Abaker, Yahia A. Idris, Abdelrahim M. Yousif
This data article describes a comprehensive dataset comprising 12,161 individual steel reinforcement bar tensile tests (3,898 test reports) collected from various construction projects across Sudan between 2016 and 2022. The data was systematically extracted from official test reports generated by the University of Khartoum, Faculty of Engineering, Department of Civil Engineering, Material and Structures Testing Laboratory. The purpose of this dataset is to establish a verified, large-scale baseline of material performance for Sudanese reinforcement steel, providing transparent and verifiable raw values of key mechanical and dimensional properties for locally sourced rebars with tested diameters ranging from 8 mm to 32 mm. This data is intended for reuse to conduct rigorous analyses on steel reinforcement quality and characteristic properties in Sudan, offering a unique baseline for regional construction quality and providing a representative performance benchmark applicable to other developing countries.
{"title":"Dataset of 12,161 steel rebar tests from sudanese construction projects (2016-2022)","authors":"Amged O. Abdelatif, Abdelrahim H. Abdelrahim, Gamar-Aldwla S. Shangray, Mohammed-Alfatih Mustafa, Mustafa M. Abaker, Yahia A. Idris, Abdelrahim M. Yousif","doi":"10.1016/j.dib.2026.112469","DOIUrl":"10.1016/j.dib.2026.112469","url":null,"abstract":"<div><div>This data article describes a comprehensive dataset comprising 12,161 individual steel reinforcement bar tensile tests (3,898 test reports) collected from various construction projects across Sudan between 2016 and 2022. The data was systematically extracted from official test reports generated by the University of Khartoum, Faculty of Engineering, Department of Civil Engineering, Material and Structures Testing Laboratory. The purpose of this dataset is to establish a verified, large-scale baseline of material performance for Sudanese reinforcement steel, providing transparent and verifiable raw values of key mechanical and dimensional properties for locally sourced rebars with tested diameters ranging from 8 mm to 32 mm. This data is intended for reuse to conduct rigorous analyses on steel reinforcement quality and characteristic properties in Sudan, offering a unique baseline for regional construction quality and providing a representative performance benchmark applicable to other developing countries.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112469"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146075167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}