首页 > 最新文献

Data in Brief最新文献

英文 中文
An open image dataset of Indonesian soybean seed varieties (Anjasmoro, Grobogan, DEGA-1) for agricultural research and machine learning applications 印度尼西亚大豆种子品种(Anjasmoro, Grobogan, DEGA-1)的开放图像数据集,用于农业研究和机器学习应用
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-03 DOI: 10.1016/j.dib.2026.112524
Diana Sofia Hanafiah , Rahmatika Alfi , Anggria Lestami , Fanindia Purnamasari , Rossy Nurhasanah , Muhammad Ariyo Syahraza , Muhammad Azis Saputra , Usman Ismail Pane , Steven Manurung , Keisya , Yunus Tio Buntoro , Josua Peter Corda , Gali Rakasiwi
Soybean (Glycine max L.) performs an important position as a main resource of protein in Indonesia. Its quality and productivity can be assessed based on the characteristics of its seed. Accordingly, the identification process through the observation of soybean seed traits is a crucial step in plant breeding and quality assurance. Manual approaches rely on manual observation, which is subjective, prone to human error and time-consuming. With the improvement of artificial intelligence, automated seed identification has appeared as a potential solution. However, progress is constrained by the lack of open and standardized image datasets, especially for locally bred varieties in developing countries. To address this gap, we propose an open image dataset of Indonesian soybean seeds from three widely cultivated and plant-bred varieties: Anjasmoro, Grobogan, and DEGA-1. The dataset consists of high-resolution seed images captured with an Epson L360 flatbed scanner, with the optical resolution fixed at 800 dots per inch, yielding images of 6800 × 9359 pixels. All raw images are saved in JPG format. No manually segmentation masks are released in this version, instead of using Deeplab V3+ with MobileNet as backbone to enable the automated seed image segmentation. The curated dataset is intended to support a broad range of applications, including computer vision tasks such as image classification and segmentation, as well as research in plant breeding, seed quality assessment, and agricultural informatics. By providing a standardized and publicly accessible resource, this dataset contributes to the advancement of interdisciplinary studies at the intersection of agriculture and artificial intelligence.
大豆(Glycine max L.)在印度尼西亚作为蛋白质的主要来源占有重要地位。根据其种子的特性可以评价其质量和产量。因此,通过观察大豆种子性状进行鉴定是植物育种和质量保证的关键步骤。人工方法依赖于人工观察,这是主观的,容易出现人为错误并且耗时。随着人工智能的提高,自动种子识别已经成为一种潜在的解决方案。然而,由于缺乏开放和标准化的图像数据集,特别是发展中国家本地育种品种的图像数据集,进展受到限制。为了解决这一差距,我们提出了一个开放的印度尼西亚大豆种子图像数据集,这些种子来自三个广泛种植和植物育种的品种:Anjasmoro、Grobogan和DEGA-1。数据集由Epson L360平板扫描仪拍摄的高分辨率种子图像组成,光学分辨率固定为800点/英寸,生成6800 × 9359像素的图像。所有原始图像都以JPG格式保存。在这个版本中没有发布手动分割掩码,而是使用Deeplab V3+与MobileNet作为主干来实现自动种子图像分割。整理的数据集旨在支持广泛的应用,包括计算机视觉任务,如图像分类和分割,以及植物育种,种子质量评估和农业信息学研究。通过提供标准化和可公开访问的资源,该数据集有助于推进农业和人工智能交叉领域的跨学科研究。
{"title":"An open image dataset of Indonesian soybean seed varieties (Anjasmoro, Grobogan, DEGA-1) for agricultural research and machine learning applications","authors":"Diana Sofia Hanafiah ,&nbsp;Rahmatika Alfi ,&nbsp;Anggria Lestami ,&nbsp;Fanindia Purnamasari ,&nbsp;Rossy Nurhasanah ,&nbsp;Muhammad Ariyo Syahraza ,&nbsp;Muhammad Azis Saputra ,&nbsp;Usman Ismail Pane ,&nbsp;Steven Manurung ,&nbsp;Keisya ,&nbsp;Yunus Tio Buntoro ,&nbsp;Josua Peter Corda ,&nbsp;Gali Rakasiwi","doi":"10.1016/j.dib.2026.112524","DOIUrl":"10.1016/j.dib.2026.112524","url":null,"abstract":"<div><div>Soybean (<em>Glycine</em> max L.<em>)</em> performs an important position as a main resource of protein in Indonesia. Its quality and productivity can be assessed based on the characteristics of its seed. Accordingly, the identification process through the observation of soybean seed traits is a crucial step in plant breeding and quality assurance. Manual approaches rely on manual observation, which is subjective, prone to human error and time-consuming. With the improvement of artificial intelligence, automated seed identification has appeared as a potential solution. However, progress is constrained by the lack of open and standardized image datasets, especially for locally bred varieties in developing countries. To address this gap, we propose an open image dataset of Indonesian soybean seeds from three widely cultivated and plant-bred varieties: Anjasmoro, Grobogan, and DEGA-1. The dataset consists of high-resolution seed images captured with an Epson L360 flatbed scanner, with the optical resolution fixed at 800 dots per inch, yielding images of 6800 × 9359 pixels. All raw images are saved in JPG format. No manually segmentation masks are released in this version, instead of using Deeplab V3+ with MobileNet as backbone to enable the automated seed image segmentation. The curated dataset is intended to support a broad range of applications, including computer vision tasks such as image classification and segmentation, as well as research in plant breeding, seed quality assessment, and agricultural informatics. By providing a standardized and publicly accessible resource, this dataset contributes to the advancement of interdisciplinary studies at the intersection of agriculture and artificial intelligence.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112524"},"PeriodicalIF":1.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dataset of scattered images using noncoherent light under varying diffusion conditions and projected patterns 使用不同扩散条件和投影模式的非相干光散射图像数据集
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-02 DOI: 10.1016/j.dib.2026.112541
Roger Chiu-Coutino , Miguel S. Soriano-Garcia , Carlos Israel Medel-Ruiz , S.M. Afanador-Delgado , Edgar Villafaña-Rauda , Roger Chiu
This data article presents an experimental dataset of scattered images, obtained using a low-cost, open-source, Raspberry Pi-based optical system. Each data sample includes two grayscale images of 256 × 256 resolution: the (i) scattered image, and (ii) original projected pattern as ground truth. The system projects diverse patterns using various optical diffusers with different scattering coefficients and physical thicknesses. The dataset includes geometric shapes, digits, and textures to increase variability and generalization. This variety allows the analysis of distinct scattering regimes and evaluation of image recovery models under varying optical complexities. The dataset supports deep learning research focused on inverse problems in optics. It is particularly useful for training and benchmarking image restoration models in scattering environments.
这篇数据文章介绍了一个散射图像的实验数据集,使用低成本、开源、基于树莓派的光学系统获得。每个数据样本包括两个256 × 256分辨率的灰度图像:(i)散射图像,(ii)原始投影模式作为地面真值。该系统使用不同散射系数和物理厚度的光漫射器投射出不同的图案。数据集包括几何形状、数字和纹理,以增加可变性和泛化。这种变化允许在不同的光学复杂性下分析不同的散射制度和评估图像恢复模型。该数据集支持光学逆问题的深度学习研究。它对于在散射环境中训练和测试图像恢复模型特别有用。
{"title":"Dataset of scattered images using noncoherent light under varying diffusion conditions and projected patterns","authors":"Roger Chiu-Coutino ,&nbsp;Miguel S. Soriano-Garcia ,&nbsp;Carlos Israel Medel-Ruiz ,&nbsp;S.M. Afanador-Delgado ,&nbsp;Edgar Villafaña-Rauda ,&nbsp;Roger Chiu","doi":"10.1016/j.dib.2026.112541","DOIUrl":"10.1016/j.dib.2026.112541","url":null,"abstract":"<div><div>This data article presents an experimental dataset of scattered images, obtained using a low-cost, open-source, Raspberry Pi-based optical system. Each data sample includes two grayscale images of 256 × 256 resolution: the (i) scattered image, and (ii) original projected pattern as ground truth. The system projects diverse patterns using various optical diffusers with different scattering coefficients and physical thicknesses. The dataset includes geometric shapes, digits, and textures to increase variability and generalization. This variety allows the analysis of distinct scattering regimes and evaluation of image recovery models under varying optical complexities. The dataset supports deep learning research focused on inverse problems in optics. It is particularly useful for training and benchmarking image restoration models in scattering environments.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112541"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Europe-wide maps of biomass density based on satellite remote sensing data for 2017, 2020, 2021 and 2023 基于2017年、2020年、2021年和2023年卫星遥感数据的全欧洲生物量密度图
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-02 DOI: 10.1016/j.dib.2026.112536
Maurizio Santoro , Oliver Cartus , Arnan Araza , Martin Herold , Jukka Miettinen , Ake Rosenqvist , Kazufumi Kobayashi , Takeo Tadono , Frank Martin Seifert
Spatially explicit information on forest structure and biomass is needed to meet the monitoring and reporting requirements of several European policies. Satellite images enable mapping and monitoring of the Europe’s forest resources through operational observations from the Sentinel-1 Synthetic Aperture Radar (SAR) and the Advanced Land Observing Satellite 2 (ALOS-2) Phased Array l-band SAR 2 (PALSAR-2) instruments. Data acquired in 2017, 2020, 2021 and 2023 were used to generate annual maps of forest biomass variables, namely Growing Stock Volume (GSV), Aboveground Biomass (AGB) and Belowground Biomass (BGB), with a pixel size of 20 m × 20 m. All products are in the geometry of the Sentinel-2 tiling system. A spatially averaged map with a pixel size of 100 m × 100 m (1 hectare) in geographic projection is also supplied, for users who do not require the highest spatial resolution. The maps were generated with a fully documented processing chain that includes (i) pre-processing of the SAR data to create stacks of co-registered terrain geocoded images of the backscattered intensity and (ii) inversion of a physically-based model to estimate GSV. AGB and BGB were subsequently estimated using allometric relationships. Per-pixel standard deviations were computed for each biomass variable by propagating uncertainties from both the SAR observations and the model parameters. The maps clearly reproduce the expected spatial patterns of forest biomass across Europe and provide sufficient spatial detail to identify biomass dynamics related to, e.g., logging and regrowth. Validation against measurements collected by National Forest Inventories (NFIs) indicates poor agreement with map values at the pixel scale, with errors larger than 50% of the reference biomass. The correspondence substantially improved for spatial aggregates, such as administrative units, for which the bias was mostly negligible and the mean square error was below 30% of the reference value. The number of ALOS-2 PALSAR-2 images affected the inter-annual consistency of the maps, which was lower in regions with only one or two observations per year.
需要关于森林结构和生物量的空间明确信息,以满足若干欧洲政策的监测和报告要求。通过Sentinel-1合成孔径雷达(SAR)和先进陆地观测卫星2 (ALOS-2)相控阵l波段SAR 2 (PALSAR-2)仪器的运行观测,卫星图像能够绘制和监测欧洲的森林资源。利用2017年、2020年、2021年和2023年获取的数据,生成森林生物量变量年图,即生长量(GSV)、地上生物量(AGB)和地下生物量(BGB),像元尺寸为20 m × 20 m。所有产品都在哨兵2号瓷砖系统的几何形状中。对于不需要最高空间分辨率的用户,还提供了地理投影中像素大小为100米× 100米(1公顷)的空间平均地图。这些地图是通过完整记录的处理链生成的,其中包括(i)对SAR数据进行预处理,以创建反向散射强度的共同注册地形地理编码图像堆栈,以及(ii)对基于物理的模型进行反演,以估计GSV。随后利用异速生长关系估计AGB和BGB。通过传播来自SAR观测和模式参数的不确定性,计算每个生物量变量的逐像素标准差。这些地图清楚地再现了整个欧洲森林生物量的预期空间格局,并提供了充分的空间细节,以确定与诸如伐木和再生等有关的生物量动态。根据国家森林调查(nfi)收集的测量数据进行验证表明,在像素尺度上与地图值的一致性较差,误差大于参考生物量的50%。对于像行政单位这样的空间聚集体,其对应性得到了显著改善,偏差几乎可以忽略不计,均方误差低于参考值的30%。ALOS-2 PALSAR-2图像的数量影响了地图的年际一致性,在每年只有一两次观测的地区,年际一致性较低。
{"title":"Europe-wide maps of biomass density based on satellite remote sensing data for 2017, 2020, 2021 and 2023","authors":"Maurizio Santoro ,&nbsp;Oliver Cartus ,&nbsp;Arnan Araza ,&nbsp;Martin Herold ,&nbsp;Jukka Miettinen ,&nbsp;Ake Rosenqvist ,&nbsp;Kazufumi Kobayashi ,&nbsp;Takeo Tadono ,&nbsp;Frank Martin Seifert","doi":"10.1016/j.dib.2026.112536","DOIUrl":"10.1016/j.dib.2026.112536","url":null,"abstract":"<div><div>Spatially explicit information on forest structure and biomass is needed to meet the monitoring and reporting requirements of several European policies. Satellite images enable mapping and monitoring of the Europe’s forest resources through operational observations from the Sentinel-1 Synthetic Aperture Radar (SAR) and the Advanced Land Observing Satellite 2 (ALOS-2) Phased Array <span>l</span>-band SAR 2 (PALSAR-2) instruments. Data acquired in 2017, 2020, 2021 and 2023 were used to generate annual maps of forest biomass variables, namely Growing Stock Volume (GSV), Aboveground Biomass (AGB) and Belowground Biomass (BGB), with a pixel size of 20 <em>m</em> × 20 m. All products are in the geometry of the Sentinel-2 tiling system. A spatially averaged map with a pixel size of 100 <em>m</em> × 100 m (1 hectare) in geographic projection is also supplied, for users who do not require the highest spatial resolution. The maps were generated with a fully documented processing chain that includes (i) pre-processing of the SAR data to create stacks of co-registered terrain geocoded images of the backscattered intensity and (ii) inversion of a physically-based model to estimate GSV. AGB and BGB were subsequently estimated using allometric relationships. Per-pixel standard deviations were computed for each biomass variable by propagating uncertainties from both the SAR observations and the model parameters. The maps clearly reproduce the expected spatial patterns of forest biomass across Europe and provide sufficient spatial detail to identify biomass dynamics related to, e.g., logging and regrowth. Validation against measurements collected by National Forest Inventories (NFIs) indicates poor agreement with map values at the pixel scale, with errors larger than 50% of the reference biomass. The correspondence substantially improved for spatial aggregates, such as administrative units, for which the bias was mostly negligible and the mean square error was below 30% of the reference value. The number of ALOS-2 PALSAR-2 images affected the inter-annual consistency of the maps, which was lower in regions with only one or two observations per year.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112536"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146184989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BigFlow-NIDS: A large-scale dataset for network intrusion detection in big data environment BigFlow-NIDS:面向大数据环境下网络入侵检测的大规模数据集
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-02 DOI: 10.1016/j.dib.2026.112530
Mohammed Borhan Uddin , Mohammad Shamsul Arefin , M.M. Musharaf Hussain
BigFlow-NIDS, a large-scale, NetFlow-based dataset for network intrusion detection research in big-data environments. BigFlow-NIDS contains 66,935,021 flows, 55 flow attributes, and 32 fine-grained attack categories, available in both CSV and Parquet formats to support scalable ML and streaming analyses. Compared with CSV, Parquet loading reduced read time dramatically (CSV: 920.82 s vs Parquet: 27.35 s) under the paper’s Colab setup, demonstrating the importance of columnar storage for large NIDS corpora. The dataset contains 36.6 million benign flows and 30.3 million attack flows, indicating a noticeable class imbalance. We release BigFlow-NIDS and provide baseline exploratory analyses and anomaly-detection experiments to support the development and evaluation of scalable, temporally-aware intrusion detection systems.
BigFlow-NIDS,一个基于netflow的大型数据集,用于大数据环境下的网络入侵检测研究。BigFlow-NIDS包含66,935,021个流,55个流属性和32个细粒度攻击类别,支持CSV和Parquet格式,以支持可扩展的ML和流分析。与CSV相比,在本文的Colab设置下,Parquet加载显著减少了读取时间(CSV: 920.82 s vs Parquet: 27.35 s),证明了列式存储对大型NIDS语料库的重要性。该数据集包含3660万个良性流和3030万个攻击流,表明了明显的类不平衡。我们发布了BigFlow-NIDS,并提供基线探索性分析和异常检测实验,以支持可扩展的、时间感知的入侵检测系统的开发和评估。
{"title":"BigFlow-NIDS: A large-scale dataset for network intrusion detection in big data environment","authors":"Mohammed Borhan Uddin ,&nbsp;Mohammad Shamsul Arefin ,&nbsp;M.M. Musharaf Hussain","doi":"10.1016/j.dib.2026.112530","DOIUrl":"10.1016/j.dib.2026.112530","url":null,"abstract":"<div><div>BigFlow-NIDS, a large-scale, NetFlow-based dataset for network intrusion detection research in big-data environments. BigFlow-NIDS contains 66,935,021 flows, 55 flow attributes, and 32 fine-grained attack categories, available in both CSV and Parquet formats to support scalable ML and streaming analyses. Compared with CSV, Parquet loading reduced read time dramatically (CSV: 920.82 s vs Parquet: 27.35 s) under the paper’s Colab setup, demonstrating the importance of columnar storage for large NIDS corpora. The dataset contains 36.6 million benign flows and 30.3 million attack flows, indicating a noticeable class imbalance. We release BigFlow-NIDS and provide baseline exploratory analyses and anomaly-detection experiments to support the development and evaluation of scalable, temporally-aware intrusion detection systems.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112530"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MedQA-MA: A Moroccan Arabic medical question-answering dataset for virtual healthcare assistants and large language models MedQA-MA:用于虚拟医疗助理和大型语言模型的摩洛哥阿拉伯医学问答数据集
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-02 DOI: 10.1016/j.dib.2026.112537
Soufiyan Ouali, Said El Garouani
The healthcare domain constitutes a fundamental pillar of national development, as maintaining population health not only enhances citizens' quality of life but also generates substantial economic benefits through increased productivity, innovation, and workforce participation. However, the healthcare industry faces numerous challenges and barriers that impede universal access to medical services. In low- and middle-income countries, significant portions of the population forego medical consultations due to various socioeconomic constraints, including prohibitive consultation fees, scheduling difficulties, and extended waiting periods. Consequently, there is an urgent need for innovative approaches to optimize healthcare delivery processes. Recent advances in artificial intelligence have demonstrated promising potential in developing intelligent systems that address healthcare accessibility gaps. These innovations include medical chatbots, appointment booking systems, disease-prediction models, and psychiatric virtual assistants. However, such technological enhancements have predominantly focused on high-resource languages, while research in low-resource languages, particularly Arabic, remains in its preliminary stages. This disparity is especially pronounced in Arabic dialects, which differ substantially from Modern Standard Arabic in terms of vocabulary, syntax, and semantic structures. To address this critical gap, we present the first comprehensive dataset for the Moroccan Arabic dialect in the healthcare domain. The MedQA-MA dataset comprises 108,943 question-answer pairs in text format, with each pair categorized according to medical specialty. Including 23 distinct medical specialties, this dataset serves multiple applications, including sentiment analysis, specialty classification, question-answering systems, and the development of human-like medical chatbots. The dataset has been meticulously curated, annotated, and validated by qualified medical professionals, ensuring its reliability and clinical relevance for developing realistic healthcare systems grounded in authentic medical interactions.
The MedQA-MA dataset is publicly available and freely accessible at https://data.mendeley.com/datasets/v6gs7nsy9z/1, representing a significant contribution to Arabic Natural Language Processing research in healthcare applications and facilitating the development of culturally and linguistically appropriate medical AI systems for Arabic-speaking populations.
医疗保健领域是国家发展的基本支柱,因为保持人口健康不仅可以提高公民的生活质量,还可以通过提高生产力、创新和劳动力参与来产生巨大的经济效益。然而,医疗保健行业面临着许多阻碍普遍获得医疗服务的挑战和障碍。在低收入和中等收入国家,由于各种社会经济限制,包括高昂的咨询费、排期困难和等待时间过长,很大一部分人口放弃了医疗咨询。因此,迫切需要创新方法来优化医疗保健服务流程。人工智能的最新进展表明,在开发解决医疗保健可及性差距的智能系统方面具有很大的潜力。这些创新包括医疗聊天机器人、预约系统、疾病预测模型和精神病学虚拟助手。然而,这种技术改进主要集中在资源丰富的语文,而对资源贫乏的语文,特别是阿拉伯语的研究仍处于初步阶段。这种差异在阿拉伯方言中尤其明显,阿拉伯方言在词汇、句法和语义结构方面与现代标准阿拉伯语有很大的不同。为了解决这一关键的差距,我们提出了第一个综合数据集的摩洛哥阿拉伯语方言在医疗保健领域。MedQA-MA数据集包括108,943对文本格式的问答对,每对都根据医学专业进行分类。该数据集包括23个不同的医学专业,服务于多种应用,包括情感分析、专业分类、问答系统和类人医疗聊天机器人的开发。该数据集由合格的医疗专业人员精心策划、注释和验证,确保其可靠性和临床相关性,以开发基于真实医疗互动的现实医疗系统。MedQA-MA数据集可在https://data.mendeley.com/datasets/v6gs7nsy9z/1上公开和免费获取,代表了对医疗保健应用中的阿拉伯自然语言处理研究的重大贡献,并促进了为阿拉伯语人口开发文化和语言上合适的医疗人工智能系统。
{"title":"MedQA-MA: A Moroccan Arabic medical question-answering dataset for virtual healthcare assistants and large language models","authors":"Soufiyan Ouali,&nbsp;Said El Garouani","doi":"10.1016/j.dib.2026.112537","DOIUrl":"10.1016/j.dib.2026.112537","url":null,"abstract":"<div><div>The healthcare domain constitutes a fundamental pillar of national development, as maintaining population health not only enhances citizens' quality of life but also generates substantial economic benefits through increased productivity, innovation, and workforce participation. However, the healthcare industry faces numerous challenges and barriers that impede universal access to medical services. In low- and middle-income countries, significant portions of the population forego medical consultations due to various socioeconomic constraints, including prohibitive consultation fees, scheduling difficulties, and extended waiting periods. Consequently, there is an urgent need for innovative approaches to optimize healthcare delivery processes. Recent advances in artificial intelligence have demonstrated promising potential in developing intelligent systems that address healthcare accessibility gaps. These innovations include medical chatbots, appointment booking systems, disease-prediction models, and psychiatric virtual assistants. However, such technological enhancements have predominantly focused on high-resource languages, while research in low-resource languages, particularly Arabic, remains in its preliminary stages. This disparity is especially pronounced in Arabic dialects, which differ substantially from Modern Standard Arabic in terms of vocabulary, syntax, and semantic structures. To address this critical gap, we present the first comprehensive dataset for the Moroccan Arabic dialect in the healthcare domain. The MedQA-MA dataset comprises 108,943 question-answer pairs in text format, with each pair categorized according to medical specialty. Including 23 distinct medical specialties, this dataset serves multiple applications, including sentiment analysis, specialty classification, question-answering systems, and the development of human-like medical chatbots. The dataset has been meticulously curated, annotated, and validated by qualified medical professionals, ensuring its reliability and clinical relevance for developing realistic healthcare systems grounded in authentic medical interactions.</div><div>The MedQA-MA dataset is publicly available and freely accessible at <span><span>https://data.mendeley.com/datasets/v6gs7nsy9z/1</span><svg><path></path></svg></span>, representing a significant contribution to Arabic Natural Language Processing research in healthcare applications and facilitating the development of culturally and linguistically appropriate medical AI systems for Arabic-speaking populations.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112537"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive multimodal MRI and EEG-TMS dataset on the impact of parietal cortex inhibition on decision-making under ambiguity 一个综合的多模态MRI和EEG-TMS数据集研究了模糊情况下顶叶皮层抑制对决策的影响
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-02 DOI: 10.1016/j.dib.2026.112535
Alejandra Figueroa-Vargas , Gabriela Valdebenito-Oyarzo , María Paz Martínez-Molina , Francisco Zamorano , Pablo Billeke
In daily life, we often face decisions where potential outcomes are unclear, creating uncertainty. The complete or partial lack of knowledge regarding outcome probabilities—referred to as ambiguity—poses significant challenges for individuals. While recent studies have linked ambiguity in decision-making to neural activity in the parietal cortex, the precise role of this region and its interactions with other brain areas remain poorly understood.
Here, we present a comprehensive dataset on human decision-making under conditions of risk and ambiguity. The dataset includes two experimental sessions. The first one corresponds to the MRI setting, which includes structural MRI (T1- and T2-weighted images, n = 52), diffusion-weighted imaging (n = 45), and task-based functional MRI (n = 38). The second session corresponds to the EEG setting combined with inhibitory transcranial magnetic stimulation (TMS), targeting two parietal regions and the vertex (n = 24). TMS targets were defined from group-level fMRI activations obtained in the first session and then transformed to individual anatomy. Ten participants completed both fMRI and EEG-TMS recordings.
This dataset, partially analyzed in previous work, now includes newly acquired and previously unexamined data—such as diffusion-weighted imaging, T2-weighted images—and is fully organized according to the Brain Imaging Data Structure (BIDS) standard. It provides valuable opportunities to investigate the neurobiological decision-making mechanisms under ambiguity, focusing on the parietal cortex.
在日常生活中,我们经常面临潜在结果不明确的决定,从而产生不确定性。完全或部分缺乏关于结果概率的知识——被称为模糊性——给个人带来了重大挑战。虽然最近的研究将决策的模糊性与顶叶皮层的神经活动联系起来,但该区域的确切作用及其与其他大脑区域的相互作用仍然知之甚少。在这里,我们提出了一个关于风险和模糊条件下人类决策的综合数据集。数据集包括两个实验环节。第一个与MRI设置相对应,包括结构MRI (T1和t2加权图像,n = 52),弥散加权成像(n = 45)和基于任务的功能MRI (n = 38)。第二阶段对应于脑电图设置结合抑制性经颅磁刺激(TMS),针对两个顶叶区域和顶点(n = 24)。经颅磁刺激靶是根据第一阶段获得的群体水平的fMRI激活来定义的,然后转化为个体解剖。10名参与者同时完成了fMRI和EEG-TMS记录。该数据集在之前的工作中进行了部分分析,现在包括新获得的和以前未检查的数据,如弥散加权成像、t2加权图像,并根据脑成像数据结构(BIDS)标准进行了完全组织。它为研究模糊情况下的神经生物学决策机制提供了宝贵的机会,重点是顶叶皮层。
{"title":"A comprehensive multimodal MRI and EEG-TMS dataset on the impact of parietal cortex inhibition on decision-making under ambiguity","authors":"Alejandra Figueroa-Vargas ,&nbsp;Gabriela Valdebenito-Oyarzo ,&nbsp;María Paz Martínez-Molina ,&nbsp;Francisco Zamorano ,&nbsp;Pablo Billeke","doi":"10.1016/j.dib.2026.112535","DOIUrl":"10.1016/j.dib.2026.112535","url":null,"abstract":"<div><div>In daily life, we often face decisions where potential outcomes are unclear, creating uncertainty. The complete or partial lack of knowledge regarding outcome probabilities—referred to as ambiguity—poses significant challenges for individuals. While recent studies have linked ambiguity in decision-making to neural activity in the parietal cortex, the precise role of this region and its interactions with other brain areas remain poorly understood.</div><div>Here, we present a comprehensive dataset on human decision-making under conditions of risk and ambiguity. The dataset includes two experimental sessions. The first one corresponds to the MRI setting, which includes structural MRI (T1- and T2-weighted images, <em>n</em> = 52), diffusion-weighted imaging (<em>n</em> = 45), and task-based functional MRI (<em>n</em> = 38). The second session corresponds to the EEG setting combined with inhibitory transcranial magnetic stimulation (TMS), targeting two parietal regions and the vertex (<em>n</em> = 24). TMS targets were defined from group-level fMRI activations obtained in the first session and then transformed to individual anatomy. Ten participants completed both fMRI and EEG-TMS recordings.</div><div>This dataset, partially analyzed in previous work, now includes newly acquired and previously unexamined data—such as diffusion-weighted imaging, T2-weighted images—and is fully organized according to the Brain Imaging Data Structure (BIDS) standard. It provides valuable opportunities to investigate the neurobiological decision-making mechanisms under ambiguity, focusing on the parietal cortex.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112535"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dataset on ecological health and microbial communities of coastal aquaculture ponds from surrounding region of Sundarban mangroves 孙德班红树林周边沿海养殖池塘生态健康与微生物群落数据集
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-31 DOI: 10.1016/j.dib.2026.112542
Yash , Anwesha Ghosh , Ajanta Dey , Milon Sinha , Nimai Bera , Sabyasachi Chakraborty , Punyasloke Bhadury
Integrated Mangrove Aquaculture (IMA) and Sustainable Aquaculture in Mangrove Ecosystem Fisheries (SAIME) are key activities undertaken across coastal regions globally to meet growing demand for brackish-water aquaculture products through sustainable practices. An in-depth biomonitoring study was conducted to map the ecological health of IMA and non-IMA aquaculture ponds in the surrounding region of the Indian Sundarbans mangroves located along the northeast coast of Bay of Bengal. Surface water samples were collected from six aquaculture ponds, four IMA (IMA_C1, IMA_C3, IMA_DB1, and IMA_DB4) and two non-IMA (C6_NM and DB5_NM) in the month of October 2022, for characterizing niche-specific biological communities using the environmental DNA (eDNA) approach. During sampling, in-situ environmental parameters were recorded. Mangrove litter-derived phenolics (tannic and gallic acids) and dissolved nutrients were estimated using a UV–Vis spectrophotometer, while dissolved organic carbon (DOC) was measured with the elemental analyzer. Metal and metalloid concentrations were determined by inductively coupled plasma mass spectrometry approach (ICP–MS). IMA ponds showed ideal conditions for shrimp aquaculture, with pH ranging from 7.913 to 8.633 and dissolved oxygen (DO) between 5.32 and 6.03 mg/L, indicating no hypoxic conditions despite higher concentrations of phenolics. High-throughput sequencing (HTS) based on Oxford Nanopore Technologies (ONT) sequencing chemistry was undertaken on the MinION platform, revealing the predominance of Proteobacteria among prokaryotes and Bacillariophyta as well as Chlorophyta among eukaryotes from extracted eDNA in each studied pond. Additionally, members of the family Cyprinidae were also detected, reflecting the biodiversity of fish population in these ponds. Functional gene profiling indicated signatures associated with nitrogen, phosphorus, sulphur, potassium and iron acquisition and metabolism, along with pathways related to aromatic compound degradation. Overall, dissolved nutrients, dissolved organic carbon (DOC), metal and metalloid ion concentrations as well as structure and functional profiles of biological communities provide a comprehensive basis for evaluating the ecological health of aquaculture ponds. This study generates important baseline information for long-term monitoring and represents the first eDNA-based high-throughput sequencing assessment of IMA and non-IMA aquaculture ponds from surface water in close proximity to the Sundarbans mangrove.
红树林综合水产养殖(IMA)和红树林生态系统渔业可持续水产养殖(SAIME)是全球沿海地区开展的关键活动,旨在通过可持续做法满足对咸淡水水产养殖产品日益增长的需求。对位于孟加拉湾东北海岸的印度孙德尔本斯红树林周边地区IMA和非IMA养殖池塘的生态健康状况进行了深入的生物监测研究。采用环境DNA (environmental DNA, eDNA)方法,于2022年10月采集了6个养殖池塘的地表水样本,其中4个IMA池塘(IMA_C1、IMA_C3、IMA_DB1和IMA_DB4)和2个非IMA池塘(C6_NM和DB5_NM),用于表征生态位特异性生物群落。在采样过程中,记录了现场环境参数。利用紫外-可见分光光度计估算红树林凋落物衍生的酚类物质(单宁酸和没食子酸)和溶解营养物质,并用元素分析仪测量溶解有机碳(DOC)。采用电感耦合等离子体质谱法(ICP-MS)测定金属和类金属浓度。IMA池塘的pH值在7.913 ~ 8.633之间,溶解氧(DO)在5.32 ~ 6.03 mg/L之间,表明尽管酚类物质浓度较高,但没有缺氧条件,是对虾养殖的理想条件。在MinION平台上进行基于Oxford Nanopore Technologies (ONT)测序化学的高通量测序(HTS),从每个研究池塘提取的eDNA中发现,原核生物中以Proteobacteria为主,硅藻中以Bacillariophyta为主,真核生物中以绿藻为主。此外,还检测到鲤科的成员,反映了这些池塘鱼类种群的多样性。功能基因图谱显示了与氮、磷、硫、钾和铁的获取和代谢相关的特征,以及与芳香族化合物降解相关的途径。总体而言,溶解营养物质、溶解有机碳(DOC)、金属和类金属离子浓度以及生物群落结构和功能特征为评价水产养殖池塘生态健康提供了综合依据。该研究为长期监测提供了重要的基线信息,并首次对孙德尔本斯红树林附近地表水中的IMA和非IMA水产养殖池塘进行了基于edna的高通量测序评估。
{"title":"Dataset on ecological health and microbial communities of coastal aquaculture ponds from surrounding region of Sundarban mangroves","authors":"Yash ,&nbsp;Anwesha Ghosh ,&nbsp;Ajanta Dey ,&nbsp;Milon Sinha ,&nbsp;Nimai Bera ,&nbsp;Sabyasachi Chakraborty ,&nbsp;Punyasloke Bhadury","doi":"10.1016/j.dib.2026.112542","DOIUrl":"10.1016/j.dib.2026.112542","url":null,"abstract":"<div><div>Integrated Mangrove Aquaculture (IMA) and Sustainable Aquaculture in Mangrove Ecosystem Fisheries (SAIME) are key activities undertaken across coastal regions globally to meet growing demand for brackish-water aquaculture products through sustainable practices. An in-depth biomonitoring study was conducted to map the ecological health of IMA and non-IMA aquaculture ponds in the surrounding region of the Indian Sundarbans mangroves located along the northeast coast of Bay of Bengal. Surface water samples were collected from six aquaculture ponds, four IMA (IMA_C1, IMA_C3, IMA_DB1, and IMA_DB4) and two non-IMA (C6_NM and DB5_NM) in the month of October 2022, for characterizing niche-specific biological communities using the environmental DNA (eDNA) approach. During sampling, <em>in-situ</em> environmental parameters were recorded. Mangrove litter-derived phenolics (tannic and gallic acids) and dissolved nutrients were estimated using a UV–Vis spectrophotometer, while dissolved organic carbon (DOC) was measured with the elemental analyzer. Metal and metalloid concentrations were determined by inductively coupled plasma mass spectrometry approach (ICP–MS). IMA ponds showed ideal conditions for shrimp aquaculture, with pH ranging from 7.913 to 8.633 and dissolved oxygen (DO) between 5.32 and 6.03 mg/L, indicating no hypoxic conditions despite higher concentrations of phenolics. High-throughput sequencing (HTS) based on Oxford Nanopore Technologies (ONT) sequencing chemistry was undertaken on the MinION platform, revealing the predominance of Proteobacteria among prokaryotes and Bacillariophyta as well as Chlorophyta among eukaryotes from extracted eDNA in each studied pond. Additionally, members of the family Cyprinidae were also detected, reflecting the biodiversity of fish population in these ponds. Functional gene profiling indicated signatures associated with nitrogen, phosphorus, sulphur, potassium and iron acquisition and metabolism, along with pathways related to aromatic compound degradation. Overall, dissolved nutrients, dissolved organic carbon (DOC), metal and metalloid ion concentrations as well as structure and functional profiles of biological communities provide a comprehensive basis for evaluating the ecological health of aquaculture ponds. This study generates important baseline information for long-term monitoring and represents the first eDNA-based high-throughput sequencing assessment of IMA and non-IMA aquaculture ponds from surface water in close proximity to the Sundarbans mangrove.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112542"},"PeriodicalIF":1.4,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A dataset of precipitate-containing multi-principal element alloys 含沉淀的多主元素合金数据集
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-30 DOI: 10.1016/j.dib.2026.112540
Anshu Raj , Xin Wang , Matthew Luebbe , Haiming Wen , Kun Lu , Shuozhi Xu
We report a curated dataset that brings together composition, processing conditions, microstructural details, and mechanical properties for 396 combinations of alloy composition and processing condition drawn from 100 peer-reviewed research articles on precipitate-containing multi-principal element alloys (MPEAs). The dataset was created by first utilizing a generative large language model for information extraction, followed by expert review to ensure accurate recovery of materials data. Compositional information was taken directly from tables and text, while processing routes — including homogenization, rolling, recrystallization, and aging — were converted into uniform temperature and time metrics. Microstructural descriptors, including precipitate phases and sizes, were consolidated into a consistent labeling scheme to accommodate the wide range of terminology used in published literature. Finally, mechanical property data, such as strength and ductility, were compiled together with the temperatures at which they were measured. These data provide a coherent view of the composition-processing-microstructure-property features explored in existing MPEA research and establish a resource that supports data-driven alloy design as well as future development of automated materials information-extraction methodologies. The complete dataset is available on Zenodo.
我们报告了一个精心整理的数据集,该数据集汇集了来自100篇同行评审的含沉淀多主元素合金(mpea)研究论文中396种合金成分和加工条件组合的成分、加工条件、显微结构细节和机械性能。该数据集首先利用生成式大型语言模型进行信息提取,然后由专家审查以确保准确恢复材料数据。成分信息直接从表格和文本中获取,而加工路线-包括均质化,轧制,再结晶和时效-被转换为统一的温度和时间指标。微观结构描述符,包括沉淀相和尺寸,被整合到一个一致的标签方案,以适应在已发表的文献中使用的广泛术语。最后,机械性能数据,如强度和延展性,与测量温度一起编制。这些数据为现有MPEA研究中探索的成分-加工-微观结构-性能特征提供了连贯的观点,并建立了支持数据驱动合金设计以及自动化材料信息提取方法未来发展的资源。完整的数据集可以在Zenodo上获得。
{"title":"A dataset of precipitate-containing multi-principal element alloys","authors":"Anshu Raj ,&nbsp;Xin Wang ,&nbsp;Matthew Luebbe ,&nbsp;Haiming Wen ,&nbsp;Kun Lu ,&nbsp;Shuozhi Xu","doi":"10.1016/j.dib.2026.112540","DOIUrl":"10.1016/j.dib.2026.112540","url":null,"abstract":"<div><div>We report a curated dataset that brings together composition, processing conditions, microstructural details, and mechanical properties for 396 combinations of alloy composition and processing condition drawn from 100 peer-reviewed research articles on precipitate-containing multi-principal element alloys (MPEAs). The dataset was created by first utilizing a generative large language model for information extraction, followed by expert review to ensure accurate recovery of materials data. Compositional information was taken directly from tables and text, while processing routes — including homogenization, rolling, recrystallization, and aging — were converted into uniform temperature and time metrics. Microstructural descriptors, including precipitate phases and sizes, were consolidated into a consistent labeling scheme to accommodate the wide range of terminology used in published literature. Finally, mechanical property data, such as strength and ductility, were compiled together with the temperatures at which they were measured. These data provide a coherent view of the composition-processing-microstructure-property features explored in existing MPEA research and establish a resource that supports data-driven alloy design as well as future development of automated materials information-extraction methodologies. The complete dataset is available on Zenodo.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112540"},"PeriodicalIF":1.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive image dataset of flexible pavement: Alligator cracks and edge-breaks from national highway (N6) of urban areas 柔性路面综合图像数据集:城市N6国道短吻鳄裂缝与边缘断裂
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-30 DOI: 10.1016/j.dib.2026.112531
Md. Nayem Hossain , Nakib Aman , Nymur Rahaman Antor , Nafisa Tasnim , Md Zubair Azam , Sabab Asfaq
This data article introduces a structured pavement surface image dataset developed to advance research in automated pavement condition assessment and data-driven road infrastructure monitoring. The dataset comprises three distinct pavement condition categories: (i) alligator cracking, (ii) edge-breaking distress, and (iii) undamaged (intact) pavement surfaces, each representing a prevalent form of pavement deterioration or intact condition typically observed in flexible pavements. The dataset consists of 12,000 raw images (4000 per class) collected under real-world conditions. These images represent the primary scientific contribution of the dataset. All images were standardized through resizing and normalization, and the dataset was partitioned into training, validation, and testing subsets to ensure reproducibility and consistency in data-driven experiments. Pavement images were collected from selected segments of National Highway N6 in Pabna District, Bangladesh, under natural daylight conditions using a smartphone camera during field surveys. Image acquisition was conducted following standard safety practices without disrupting traffic flow. All images were manually reviewed and labelled to ensure annotation accuracy. This dataset is intended to support research on automated pavement crack detection and classification, benchmarking of computer vision and deep learning models, and the development of lightweight and edge-deployable inspection systems.
这篇数据文章介绍了一个结构化的路面表面图像数据集,旨在推进自动路面状况评估和数据驱动的道路基础设施监测的研究。该数据集包括三种不同的路面状况类别:(i)鳄鱼裂缝,(ii)边缘破损,以及(iii)未损坏(完整)的路面,每一种都代表了一种常见的路面恶化形式或通常在柔性路面中观察到的完好状况。该数据集由在真实世界条件下收集的12,000张原始图像(每个类4000张)组成。这些图像代表了数据集的主要科学贡献。通过调整大小和归一化对所有图像进行标准化,并将数据集划分为训练、验证和测试子集,以确保数据驱动实验的可重复性和一致性。在实地调查期间,使用智能手机相机在自然日光条件下收集了孟加拉国Pabna地区N6国道选定路段的路面图像。图像采集是在不干扰交通流量的情况下按照标准安全措施进行的。所有的图像都是手动审查和标记,以确保注释的准确性。该数据集旨在支持自动路面裂缝检测和分类的研究,计算机视觉和深度学习模型的基准测试,以及轻量级和边缘可部署检测系统的开发。
{"title":"Comprehensive image dataset of flexible pavement: Alligator cracks and edge-breaks from national highway (N6) of urban areas","authors":"Md. Nayem Hossain ,&nbsp;Nakib Aman ,&nbsp;Nymur Rahaman Antor ,&nbsp;Nafisa Tasnim ,&nbsp;Md Zubair Azam ,&nbsp;Sabab Asfaq","doi":"10.1016/j.dib.2026.112531","DOIUrl":"10.1016/j.dib.2026.112531","url":null,"abstract":"<div><div>This data article introduces a structured pavement surface image dataset developed to advance research in automated pavement condition assessment and data-driven road infrastructure monitoring. The dataset comprises three distinct pavement condition categories: (i) alligator cracking, (ii) edge-breaking distress, and (iii) undamaged (intact) pavement surfaces, each representing a prevalent form of pavement deterioration or intact condition typically observed in flexible pavements. The dataset consists of 12,000 raw images (4000 per class) collected under real-world conditions. These images represent the primary scientific contribution of the dataset. All images were standardized through resizing and normalization, and the dataset was partitioned into training, validation, and testing subsets to ensure reproducibility and consistency in data-driven experiments. Pavement images were collected from selected segments of National Highway N6 in Pabna District, Bangladesh, under natural daylight conditions using a smartphone camera during field surveys. Image acquisition was conducted following standard safety practices without disrupting traffic flow. All images were manually reviewed and labelled to ensure annotation accuracy. This dataset is intended to support research on automated pavement crack detection and classification, benchmarking of computer vision and deep learning models, and the development of lightweight and edge-deployable inspection systems.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112531"},"PeriodicalIF":1.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dataset on resource allocation and usage for a private cloud 关于私有云资源分配和使用的数据集
IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-29 DOI: 10.1016/j.dib.2026.112514
Paola Marques, Mariana Mendes, Thiago Emmanuel Pereira, Giovanni Farias
While public cloud providers dominate the commercial landscape, private clouds are widely adopted by academic and research institutions to meet specific governance and operational requirements. There are multiple available datasets about resource usage of public clouds; however, datasets capturing usage patterns in private clouds remain scarce, which limits research in this area. This work presents a dataset comprising over 64 million records collected from a private OpenStack-based cloud operated by the Distributed Systems Laboratory at the Federal University of Campina Grande, Brazil. Data was continuously gathered over nearly twelve months (May 23, 2024 to May 16, 2025), periodically querying OpenStack APIs and monitoring services every five minutes. The dataset captures different aspects of the infrastructure, allocation quotas, user-to-project associations (as OpenStack groups users into projects), server (virtual machines) specifications, and resource utilization for users and projects. Entries are timestamped, enabling temporal analyses of system dynamics. Sensitive attributes, such as user names, project names, IP addresses, and server names were protected, leaving only system-generated UUIDs. By offering a detailed, time-stamped, view of a private cloud, this dataset provides a valuable resource for cloud computing research, helping to bridge the gap in publicly available datasets from non-commercial cloud environments. The dataset is valuable not only for academic institutions but also for companies considering cloud repatriation.
虽然公共云提供商在商业领域占据主导地位,但私有云被学术和研究机构广泛采用,以满足特定的治理和运营需求。关于公有云的资源使用有多个可用的数据集;然而,捕获私有云使用模式的数据集仍然很少,这限制了该领域的研究。这项工作展示了一个包含超过6400万条记录的数据集,这些记录来自一个由巴西坎皮纳格兰德联邦大学分布式系统实验室运营的基于openstack的私有云。连续收集数据近12个月(2024年5月23日- 2025年5月16日),每5分钟周期性查询OpenStack api和监控服务。该数据集捕获了基础设施、分配配额、用户到项目的关联(因为OpenStack将用户分组到项目中)、服务器(虚拟机)规范以及用户和项目的资源利用率的不同方面。条目有时间戳,支持对系统动力学进行时间分析。敏感属性,如用户名、项目名、IP地址和服务器名受到保护,只留下系统生成的uuid。通过提供详细的、带有时间戳的私有云视图,该数据集为云计算研究提供了宝贵的资源,有助于弥合来自非商业云环境的公开可用数据集的差距。该数据集不仅对学术机构很有价值,对考虑云回归的公司也很有价值。
{"title":"Dataset on resource allocation and usage for a private cloud","authors":"Paola Marques,&nbsp;Mariana Mendes,&nbsp;Thiago Emmanuel Pereira,&nbsp;Giovanni Farias","doi":"10.1016/j.dib.2026.112514","DOIUrl":"10.1016/j.dib.2026.112514","url":null,"abstract":"<div><div>While public cloud providers dominate the commercial landscape, private clouds are widely adopted by academic and research institutions to meet specific governance and operational requirements. There are multiple available datasets about resource usage of public clouds; however, datasets capturing usage patterns in private clouds remain scarce, which limits research in this area. This work presents a dataset comprising over 64 million records collected from a private OpenStack-based cloud operated by the Distributed Systems Laboratory at the Federal University of Campina Grande, Brazil. Data was continuously gathered over nearly twelve months (May 23, 2024 to May 16, 2025), periodically querying OpenStack APIs and monitoring services every five minutes. The dataset captures different aspects of the infrastructure, allocation quotas, user-to-project associations (as OpenStack groups users into projects), server (virtual machines) specifications, and resource utilization for users and projects. Entries are timestamped, enabling temporal analyses of system dynamics. Sensitive attributes, such as user names, project names, IP addresses, and server names were protected, leaving only system-generated UUIDs. By offering a detailed, time-stamped, view of a private cloud, this dataset provides a valuable resource for cloud computing research, helping to bridge the gap in publicly available datasets from non-commercial cloud environments. The dataset is valuable not only for academic institutions but also for companies considering cloud repatriation.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112514"},"PeriodicalIF":1.4,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data in Brief
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1