Pub Date : 2026-02-03DOI: 10.1016/j.dib.2026.112524
Diana Sofia Hanafiah , Rahmatika Alfi , Anggria Lestami , Fanindia Purnamasari , Rossy Nurhasanah , Muhammad Ariyo Syahraza , Muhammad Azis Saputra , Usman Ismail Pane , Steven Manurung , Keisya , Yunus Tio Buntoro , Josua Peter Corda , Gali Rakasiwi
Soybean (Glycine max L.) performs an important position as a main resource of protein in Indonesia. Its quality and productivity can be assessed based on the characteristics of its seed. Accordingly, the identification process through the observation of soybean seed traits is a crucial step in plant breeding and quality assurance. Manual approaches rely on manual observation, which is subjective, prone to human error and time-consuming. With the improvement of artificial intelligence, automated seed identification has appeared as a potential solution. However, progress is constrained by the lack of open and standardized image datasets, especially for locally bred varieties in developing countries. To address this gap, we propose an open image dataset of Indonesian soybean seeds from three widely cultivated and plant-bred varieties: Anjasmoro, Grobogan, and DEGA-1. The dataset consists of high-resolution seed images captured with an Epson L360 flatbed scanner, with the optical resolution fixed at 800 dots per inch, yielding images of 6800 × 9359 pixels. All raw images are saved in JPG format. No manually segmentation masks are released in this version, instead of using Deeplab V3+ with MobileNet as backbone to enable the automated seed image segmentation. The curated dataset is intended to support a broad range of applications, including computer vision tasks such as image classification and segmentation, as well as research in plant breeding, seed quality assessment, and agricultural informatics. By providing a standardized and publicly accessible resource, this dataset contributes to the advancement of interdisciplinary studies at the intersection of agriculture and artificial intelligence.
大豆(Glycine max L.)在印度尼西亚作为蛋白质的主要来源占有重要地位。根据其种子的特性可以评价其质量和产量。因此,通过观察大豆种子性状进行鉴定是植物育种和质量保证的关键步骤。人工方法依赖于人工观察,这是主观的,容易出现人为错误并且耗时。随着人工智能的提高,自动种子识别已经成为一种潜在的解决方案。然而,由于缺乏开放和标准化的图像数据集,特别是发展中国家本地育种品种的图像数据集,进展受到限制。为了解决这一差距,我们提出了一个开放的印度尼西亚大豆种子图像数据集,这些种子来自三个广泛种植和植物育种的品种:Anjasmoro、Grobogan和DEGA-1。数据集由Epson L360平板扫描仪拍摄的高分辨率种子图像组成,光学分辨率固定为800点/英寸,生成6800 × 9359像素的图像。所有原始图像都以JPG格式保存。在这个版本中没有发布手动分割掩码,而是使用Deeplab V3+与MobileNet作为主干来实现自动种子图像分割。整理的数据集旨在支持广泛的应用,包括计算机视觉任务,如图像分类和分割,以及植物育种,种子质量评估和农业信息学研究。通过提供标准化和可公开访问的资源,该数据集有助于推进农业和人工智能交叉领域的跨学科研究。
{"title":"An open image dataset of Indonesian soybean seed varieties (Anjasmoro, Grobogan, DEGA-1) for agricultural research and machine learning applications","authors":"Diana Sofia Hanafiah , Rahmatika Alfi , Anggria Lestami , Fanindia Purnamasari , Rossy Nurhasanah , Muhammad Ariyo Syahraza , Muhammad Azis Saputra , Usman Ismail Pane , Steven Manurung , Keisya , Yunus Tio Buntoro , Josua Peter Corda , Gali Rakasiwi","doi":"10.1016/j.dib.2026.112524","DOIUrl":"10.1016/j.dib.2026.112524","url":null,"abstract":"<div><div>Soybean (<em>Glycine</em> max L.<em>)</em> performs an important position as a main resource of protein in Indonesia. Its quality and productivity can be assessed based on the characteristics of its seed. Accordingly, the identification process through the observation of soybean seed traits is a crucial step in plant breeding and quality assurance. Manual approaches rely on manual observation, which is subjective, prone to human error and time-consuming. With the improvement of artificial intelligence, automated seed identification has appeared as a potential solution. However, progress is constrained by the lack of open and standardized image datasets, especially for locally bred varieties in developing countries. To address this gap, we propose an open image dataset of Indonesian soybean seeds from three widely cultivated and plant-bred varieties: Anjasmoro, Grobogan, and DEGA-1. The dataset consists of high-resolution seed images captured with an Epson L360 flatbed scanner, with the optical resolution fixed at 800 dots per inch, yielding images of 6800 × 9359 pixels. All raw images are saved in JPG format. No manually segmentation masks are released in this version, instead of using Deeplab V3+ with MobileNet as backbone to enable the automated seed image segmentation. The curated dataset is intended to support a broad range of applications, including computer vision tasks such as image classification and segmentation, as well as research in plant breeding, seed quality assessment, and agricultural informatics. By providing a standardized and publicly accessible resource, this dataset contributes to the advancement of interdisciplinary studies at the intersection of agriculture and artificial intelligence.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112524"},"PeriodicalIF":1.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1016/j.dib.2026.112541
Roger Chiu-Coutino , Miguel S. Soriano-Garcia , Carlos Israel Medel-Ruiz , S.M. Afanador-Delgado , Edgar Villafaña-Rauda , Roger Chiu
This data article presents an experimental dataset of scattered images, obtained using a low-cost, open-source, Raspberry Pi-based optical system. Each data sample includes two grayscale images of 256 × 256 resolution: the (i) scattered image, and (ii) original projected pattern as ground truth. The system projects diverse patterns using various optical diffusers with different scattering coefficients and physical thicknesses. The dataset includes geometric shapes, digits, and textures to increase variability and generalization. This variety allows the analysis of distinct scattering regimes and evaluation of image recovery models under varying optical complexities. The dataset supports deep learning research focused on inverse problems in optics. It is particularly useful for training and benchmarking image restoration models in scattering environments.
{"title":"Dataset of scattered images using noncoherent light under varying diffusion conditions and projected patterns","authors":"Roger Chiu-Coutino , Miguel S. Soriano-Garcia , Carlos Israel Medel-Ruiz , S.M. Afanador-Delgado , Edgar Villafaña-Rauda , Roger Chiu","doi":"10.1016/j.dib.2026.112541","DOIUrl":"10.1016/j.dib.2026.112541","url":null,"abstract":"<div><div>This data article presents an experimental dataset of scattered images, obtained using a low-cost, open-source, Raspberry Pi-based optical system. Each data sample includes two grayscale images of 256 × 256 resolution: the (i) scattered image, and (ii) original projected pattern as ground truth. The system projects diverse patterns using various optical diffusers with different scattering coefficients and physical thicknesses. The dataset includes geometric shapes, digits, and textures to increase variability and generalization. This variety allows the analysis of distinct scattering regimes and evaluation of image recovery models under varying optical complexities. The dataset supports deep learning research focused on inverse problems in optics. It is particularly useful for training and benchmarking image restoration models in scattering environments.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112541"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1016/j.dib.2026.112536
Maurizio Santoro , Oliver Cartus , Arnan Araza , Martin Herold , Jukka Miettinen , Ake Rosenqvist , Kazufumi Kobayashi , Takeo Tadono , Frank Martin Seifert
Spatially explicit information on forest structure and biomass is needed to meet the monitoring and reporting requirements of several European policies. Satellite images enable mapping and monitoring of the Europe’s forest resources through operational observations from the Sentinel-1 Synthetic Aperture Radar (SAR) and the Advanced Land Observing Satellite 2 (ALOS-2) Phased Array l-band SAR 2 (PALSAR-2) instruments. Data acquired in 2017, 2020, 2021 and 2023 were used to generate annual maps of forest biomass variables, namely Growing Stock Volume (GSV), Aboveground Biomass (AGB) and Belowground Biomass (BGB), with a pixel size of 20 m × 20 m. All products are in the geometry of the Sentinel-2 tiling system. A spatially averaged map with a pixel size of 100 m × 100 m (1 hectare) in geographic projection is also supplied, for users who do not require the highest spatial resolution. The maps were generated with a fully documented processing chain that includes (i) pre-processing of the SAR data to create stacks of co-registered terrain geocoded images of the backscattered intensity and (ii) inversion of a physically-based model to estimate GSV. AGB and BGB were subsequently estimated using allometric relationships. Per-pixel standard deviations were computed for each biomass variable by propagating uncertainties from both the SAR observations and the model parameters. The maps clearly reproduce the expected spatial patterns of forest biomass across Europe and provide sufficient spatial detail to identify biomass dynamics related to, e.g., logging and regrowth. Validation against measurements collected by National Forest Inventories (NFIs) indicates poor agreement with map values at the pixel scale, with errors larger than 50% of the reference biomass. The correspondence substantially improved for spatial aggregates, such as administrative units, for which the bias was mostly negligible and the mean square error was below 30% of the reference value. The number of ALOS-2 PALSAR-2 images affected the inter-annual consistency of the maps, which was lower in regions with only one or two observations per year.
需要关于森林结构和生物量的空间明确信息,以满足若干欧洲政策的监测和报告要求。通过Sentinel-1合成孔径雷达(SAR)和先进陆地观测卫星2 (ALOS-2)相控阵l波段SAR 2 (PALSAR-2)仪器的运行观测,卫星图像能够绘制和监测欧洲的森林资源。利用2017年、2020年、2021年和2023年获取的数据,生成森林生物量变量年图,即生长量(GSV)、地上生物量(AGB)和地下生物量(BGB),像元尺寸为20 m × 20 m。所有产品都在哨兵2号瓷砖系统的几何形状中。对于不需要最高空间分辨率的用户,还提供了地理投影中像素大小为100米× 100米(1公顷)的空间平均地图。这些地图是通过完整记录的处理链生成的,其中包括(i)对SAR数据进行预处理,以创建反向散射强度的共同注册地形地理编码图像堆栈,以及(ii)对基于物理的模型进行反演,以估计GSV。随后利用异速生长关系估计AGB和BGB。通过传播来自SAR观测和模式参数的不确定性,计算每个生物量变量的逐像素标准差。这些地图清楚地再现了整个欧洲森林生物量的预期空间格局,并提供了充分的空间细节,以确定与诸如伐木和再生等有关的生物量动态。根据国家森林调查(nfi)收集的测量数据进行验证表明,在像素尺度上与地图值的一致性较差,误差大于参考生物量的50%。对于像行政单位这样的空间聚集体,其对应性得到了显著改善,偏差几乎可以忽略不计,均方误差低于参考值的30%。ALOS-2 PALSAR-2图像的数量影响了地图的年际一致性,在每年只有一两次观测的地区,年际一致性较低。
{"title":"Europe-wide maps of biomass density based on satellite remote sensing data for 2017, 2020, 2021 and 2023","authors":"Maurizio Santoro , Oliver Cartus , Arnan Araza , Martin Herold , Jukka Miettinen , Ake Rosenqvist , Kazufumi Kobayashi , Takeo Tadono , Frank Martin Seifert","doi":"10.1016/j.dib.2026.112536","DOIUrl":"10.1016/j.dib.2026.112536","url":null,"abstract":"<div><div>Spatially explicit information on forest structure and biomass is needed to meet the monitoring and reporting requirements of several European policies. Satellite images enable mapping and monitoring of the Europe’s forest resources through operational observations from the Sentinel-1 Synthetic Aperture Radar (SAR) and the Advanced Land Observing Satellite 2 (ALOS-2) Phased Array <span>l</span>-band SAR 2 (PALSAR-2) instruments. Data acquired in 2017, 2020, 2021 and 2023 were used to generate annual maps of forest biomass variables, namely Growing Stock Volume (GSV), Aboveground Biomass (AGB) and Belowground Biomass (BGB), with a pixel size of 20 <em>m</em> × 20 m. All products are in the geometry of the Sentinel-2 tiling system. A spatially averaged map with a pixel size of 100 <em>m</em> × 100 m (1 hectare) in geographic projection is also supplied, for users who do not require the highest spatial resolution. The maps were generated with a fully documented processing chain that includes (i) pre-processing of the SAR data to create stacks of co-registered terrain geocoded images of the backscattered intensity and (ii) inversion of a physically-based model to estimate GSV. AGB and BGB were subsequently estimated using allometric relationships. Per-pixel standard deviations were computed for each biomass variable by propagating uncertainties from both the SAR observations and the model parameters. The maps clearly reproduce the expected spatial patterns of forest biomass across Europe and provide sufficient spatial detail to identify biomass dynamics related to, e.g., logging and regrowth. Validation against measurements collected by National Forest Inventories (NFIs) indicates poor agreement with map values at the pixel scale, with errors larger than 50% of the reference biomass. The correspondence substantially improved for spatial aggregates, such as administrative units, for which the bias was mostly negligible and the mean square error was below 30% of the reference value. The number of ALOS-2 PALSAR-2 images affected the inter-annual consistency of the maps, which was lower in regions with only one or two observations per year.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112536"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146184989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1016/j.dib.2026.112530
Mohammed Borhan Uddin , Mohammad Shamsul Arefin , M.M. Musharaf Hussain
BigFlow-NIDS, a large-scale, NetFlow-based dataset for network intrusion detection research in big-data environments. BigFlow-NIDS contains 66,935,021 flows, 55 flow attributes, and 32 fine-grained attack categories, available in both CSV and Parquet formats to support scalable ML and streaming analyses. Compared with CSV, Parquet loading reduced read time dramatically (CSV: 920.82 s vs Parquet: 27.35 s) under the paper’s Colab setup, demonstrating the importance of columnar storage for large NIDS corpora. The dataset contains 36.6 million benign flows and 30.3 million attack flows, indicating a noticeable class imbalance. We release BigFlow-NIDS and provide baseline exploratory analyses and anomaly-detection experiments to support the development and evaluation of scalable, temporally-aware intrusion detection systems.
BigFlow-NIDS,一个基于netflow的大型数据集,用于大数据环境下的网络入侵检测研究。BigFlow-NIDS包含66,935,021个流,55个流属性和32个细粒度攻击类别,支持CSV和Parquet格式,以支持可扩展的ML和流分析。与CSV相比,在本文的Colab设置下,Parquet加载显著减少了读取时间(CSV: 920.82 s vs Parquet: 27.35 s),证明了列式存储对大型NIDS语料库的重要性。该数据集包含3660万个良性流和3030万个攻击流,表明了明显的类不平衡。我们发布了BigFlow-NIDS,并提供基线探索性分析和异常检测实验,以支持可扩展的、时间感知的入侵检测系统的开发和评估。
{"title":"BigFlow-NIDS: A large-scale dataset for network intrusion detection in big data environment","authors":"Mohammed Borhan Uddin , Mohammad Shamsul Arefin , M.M. Musharaf Hussain","doi":"10.1016/j.dib.2026.112530","DOIUrl":"10.1016/j.dib.2026.112530","url":null,"abstract":"<div><div>BigFlow-NIDS, a large-scale, NetFlow-based dataset for network intrusion detection research in big-data environments. BigFlow-NIDS contains 66,935,021 flows, 55 flow attributes, and 32 fine-grained attack categories, available in both CSV and Parquet formats to support scalable ML and streaming analyses. Compared with CSV, Parquet loading reduced read time dramatically (CSV: 920.82 s vs Parquet: 27.35 s) under the paper’s Colab setup, demonstrating the importance of columnar storage for large NIDS corpora. The dataset contains 36.6 million benign flows and 30.3 million attack flows, indicating a noticeable class imbalance. We release BigFlow-NIDS and provide baseline exploratory analyses and anomaly-detection experiments to support the development and evaluation of scalable, temporally-aware intrusion detection systems.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112530"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1016/j.dib.2026.112537
Soufiyan Ouali, Said El Garouani
The healthcare domain constitutes a fundamental pillar of national development, as maintaining population health not only enhances citizens' quality of life but also generates substantial economic benefits through increased productivity, innovation, and workforce participation. However, the healthcare industry faces numerous challenges and barriers that impede universal access to medical services. In low- and middle-income countries, significant portions of the population forego medical consultations due to various socioeconomic constraints, including prohibitive consultation fees, scheduling difficulties, and extended waiting periods. Consequently, there is an urgent need for innovative approaches to optimize healthcare delivery processes. Recent advances in artificial intelligence have demonstrated promising potential in developing intelligent systems that address healthcare accessibility gaps. These innovations include medical chatbots, appointment booking systems, disease-prediction models, and psychiatric virtual assistants. However, such technological enhancements have predominantly focused on high-resource languages, while research in low-resource languages, particularly Arabic, remains in its preliminary stages. This disparity is especially pronounced in Arabic dialects, which differ substantially from Modern Standard Arabic in terms of vocabulary, syntax, and semantic structures. To address this critical gap, we present the first comprehensive dataset for the Moroccan Arabic dialect in the healthcare domain. The MedQA-MA dataset comprises 108,943 question-answer pairs in text format, with each pair categorized according to medical specialty. Including 23 distinct medical specialties, this dataset serves multiple applications, including sentiment analysis, specialty classification, question-answering systems, and the development of human-like medical chatbots. The dataset has been meticulously curated, annotated, and validated by qualified medical professionals, ensuring its reliability and clinical relevance for developing realistic healthcare systems grounded in authentic medical interactions.
The MedQA-MA dataset is publicly available and freely accessible at https://data.mendeley.com/datasets/v6gs7nsy9z/1, representing a significant contribution to Arabic Natural Language Processing research in healthcare applications and facilitating the development of culturally and linguistically appropriate medical AI systems for Arabic-speaking populations.
{"title":"MedQA-MA: A Moroccan Arabic medical question-answering dataset for virtual healthcare assistants and large language models","authors":"Soufiyan Ouali, Said El Garouani","doi":"10.1016/j.dib.2026.112537","DOIUrl":"10.1016/j.dib.2026.112537","url":null,"abstract":"<div><div>The healthcare domain constitutes a fundamental pillar of national development, as maintaining population health not only enhances citizens' quality of life but also generates substantial economic benefits through increased productivity, innovation, and workforce participation. However, the healthcare industry faces numerous challenges and barriers that impede universal access to medical services. In low- and middle-income countries, significant portions of the population forego medical consultations due to various socioeconomic constraints, including prohibitive consultation fees, scheduling difficulties, and extended waiting periods. Consequently, there is an urgent need for innovative approaches to optimize healthcare delivery processes. Recent advances in artificial intelligence have demonstrated promising potential in developing intelligent systems that address healthcare accessibility gaps. These innovations include medical chatbots, appointment booking systems, disease-prediction models, and psychiatric virtual assistants. However, such technological enhancements have predominantly focused on high-resource languages, while research in low-resource languages, particularly Arabic, remains in its preliminary stages. This disparity is especially pronounced in Arabic dialects, which differ substantially from Modern Standard Arabic in terms of vocabulary, syntax, and semantic structures. To address this critical gap, we present the first comprehensive dataset for the Moroccan Arabic dialect in the healthcare domain. The MedQA-MA dataset comprises 108,943 question-answer pairs in text format, with each pair categorized according to medical specialty. Including 23 distinct medical specialties, this dataset serves multiple applications, including sentiment analysis, specialty classification, question-answering systems, and the development of human-like medical chatbots. The dataset has been meticulously curated, annotated, and validated by qualified medical professionals, ensuring its reliability and clinical relevance for developing realistic healthcare systems grounded in authentic medical interactions.</div><div>The MedQA-MA dataset is publicly available and freely accessible at <span><span>https://data.mendeley.com/datasets/v6gs7nsy9z/1</span><svg><path></path></svg></span>, representing a significant contribution to Arabic Natural Language Processing research in healthcare applications and facilitating the development of culturally and linguistically appropriate medical AI systems for Arabic-speaking populations.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112537"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1016/j.dib.2026.112535
Alejandra Figueroa-Vargas , Gabriela Valdebenito-Oyarzo , María Paz Martínez-Molina , Francisco Zamorano , Pablo Billeke
In daily life, we often face decisions where potential outcomes are unclear, creating uncertainty. The complete or partial lack of knowledge regarding outcome probabilities—referred to as ambiguity—poses significant challenges for individuals. While recent studies have linked ambiguity in decision-making to neural activity in the parietal cortex, the precise role of this region and its interactions with other brain areas remain poorly understood.
Here, we present a comprehensive dataset on human decision-making under conditions of risk and ambiguity. The dataset includes two experimental sessions. The first one corresponds to the MRI setting, which includes structural MRI (T1- and T2-weighted images, n = 52), diffusion-weighted imaging (n = 45), and task-based functional MRI (n = 38). The second session corresponds to the EEG setting combined with inhibitory transcranial magnetic stimulation (TMS), targeting two parietal regions and the vertex (n = 24). TMS targets were defined from group-level fMRI activations obtained in the first session and then transformed to individual anatomy. Ten participants completed both fMRI and EEG-TMS recordings.
This dataset, partially analyzed in previous work, now includes newly acquired and previously unexamined data—such as diffusion-weighted imaging, T2-weighted images—and is fully organized according to the Brain Imaging Data Structure (BIDS) standard. It provides valuable opportunities to investigate the neurobiological decision-making mechanisms under ambiguity, focusing on the parietal cortex.
{"title":"A comprehensive multimodal MRI and EEG-TMS dataset on the impact of parietal cortex inhibition on decision-making under ambiguity","authors":"Alejandra Figueroa-Vargas , Gabriela Valdebenito-Oyarzo , María Paz Martínez-Molina , Francisco Zamorano , Pablo Billeke","doi":"10.1016/j.dib.2026.112535","DOIUrl":"10.1016/j.dib.2026.112535","url":null,"abstract":"<div><div>In daily life, we often face decisions where potential outcomes are unclear, creating uncertainty. The complete or partial lack of knowledge regarding outcome probabilities—referred to as ambiguity—poses significant challenges for individuals. While recent studies have linked ambiguity in decision-making to neural activity in the parietal cortex, the precise role of this region and its interactions with other brain areas remain poorly understood.</div><div>Here, we present a comprehensive dataset on human decision-making under conditions of risk and ambiguity. The dataset includes two experimental sessions. The first one corresponds to the MRI setting, which includes structural MRI (T1- and T2-weighted images, <em>n</em> = 52), diffusion-weighted imaging (<em>n</em> = 45), and task-based functional MRI (<em>n</em> = 38). The second session corresponds to the EEG setting combined with inhibitory transcranial magnetic stimulation (TMS), targeting two parietal regions and the vertex (<em>n</em> = 24). TMS targets were defined from group-level fMRI activations obtained in the first session and then transformed to individual anatomy. Ten participants completed both fMRI and EEG-TMS recordings.</div><div>This dataset, partially analyzed in previous work, now includes newly acquired and previously unexamined data—such as diffusion-weighted imaging, T2-weighted images—and is fully organized according to the Brain Imaging Data Structure (BIDS) standard. It provides valuable opportunities to investigate the neurobiological decision-making mechanisms under ambiguity, focusing on the parietal cortex.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112535"},"PeriodicalIF":1.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Integrated Mangrove Aquaculture (IMA) and Sustainable Aquaculture in Mangrove Ecosystem Fisheries (SAIME) are key activities undertaken across coastal regions globally to meet growing demand for brackish-water aquaculture products through sustainable practices. An in-depth biomonitoring study was conducted to map the ecological health of IMA and non-IMA aquaculture ponds in the surrounding region of the Indian Sundarbans mangroves located along the northeast coast of Bay of Bengal. Surface water samples were collected from six aquaculture ponds, four IMA (IMA_C1, IMA_C3, IMA_DB1, and IMA_DB4) and two non-IMA (C6_NM and DB5_NM) in the month of October 2022, for characterizing niche-specific biological communities using the environmental DNA (eDNA) approach. During sampling, in-situ environmental parameters were recorded. Mangrove litter-derived phenolics (tannic and gallic acids) and dissolved nutrients were estimated using a UV–Vis spectrophotometer, while dissolved organic carbon (DOC) was measured with the elemental analyzer. Metal and metalloid concentrations were determined by inductively coupled plasma mass spectrometry approach (ICP–MS). IMA ponds showed ideal conditions for shrimp aquaculture, with pH ranging from 7.913 to 8.633 and dissolved oxygen (DO) between 5.32 and 6.03 mg/L, indicating no hypoxic conditions despite higher concentrations of phenolics. High-throughput sequencing (HTS) based on Oxford Nanopore Technologies (ONT) sequencing chemistry was undertaken on the MinION platform, revealing the predominance of Proteobacteria among prokaryotes and Bacillariophyta as well as Chlorophyta among eukaryotes from extracted eDNA in each studied pond. Additionally, members of the family Cyprinidae were also detected, reflecting the biodiversity of fish population in these ponds. Functional gene profiling indicated signatures associated with nitrogen, phosphorus, sulphur, potassium and iron acquisition and metabolism, along with pathways related to aromatic compound degradation. Overall, dissolved nutrients, dissolved organic carbon (DOC), metal and metalloid ion concentrations as well as structure and functional profiles of biological communities provide a comprehensive basis for evaluating the ecological health of aquaculture ponds. This study generates important baseline information for long-term monitoring and represents the first eDNA-based high-throughput sequencing assessment of IMA and non-IMA aquaculture ponds from surface water in close proximity to the Sundarbans mangrove.
{"title":"Dataset on ecological health and microbial communities of coastal aquaculture ponds from surrounding region of Sundarban mangroves","authors":"Yash , Anwesha Ghosh , Ajanta Dey , Milon Sinha , Nimai Bera , Sabyasachi Chakraborty , Punyasloke Bhadury","doi":"10.1016/j.dib.2026.112542","DOIUrl":"10.1016/j.dib.2026.112542","url":null,"abstract":"<div><div>Integrated Mangrove Aquaculture (IMA) and Sustainable Aquaculture in Mangrove Ecosystem Fisheries (SAIME) are key activities undertaken across coastal regions globally to meet growing demand for brackish-water aquaculture products through sustainable practices. An in-depth biomonitoring study was conducted to map the ecological health of IMA and non-IMA aquaculture ponds in the surrounding region of the Indian Sundarbans mangroves located along the northeast coast of Bay of Bengal. Surface water samples were collected from six aquaculture ponds, four IMA (IMA_C1, IMA_C3, IMA_DB1, and IMA_DB4) and two non-IMA (C6_NM and DB5_NM) in the month of October 2022, for characterizing niche-specific biological communities using the environmental DNA (eDNA) approach. During sampling, <em>in-situ</em> environmental parameters were recorded. Mangrove litter-derived phenolics (tannic and gallic acids) and dissolved nutrients were estimated using a UV–Vis spectrophotometer, while dissolved organic carbon (DOC) was measured with the elemental analyzer. Metal and metalloid concentrations were determined by inductively coupled plasma mass spectrometry approach (ICP–MS). IMA ponds showed ideal conditions for shrimp aquaculture, with pH ranging from 7.913 to 8.633 and dissolved oxygen (DO) between 5.32 and 6.03 mg/L, indicating no hypoxic conditions despite higher concentrations of phenolics. High-throughput sequencing (HTS) based on Oxford Nanopore Technologies (ONT) sequencing chemistry was undertaken on the MinION platform, revealing the predominance of Proteobacteria among prokaryotes and Bacillariophyta as well as Chlorophyta among eukaryotes from extracted eDNA in each studied pond. Additionally, members of the family Cyprinidae were also detected, reflecting the biodiversity of fish population in these ponds. Functional gene profiling indicated signatures associated with nitrogen, phosphorus, sulphur, potassium and iron acquisition and metabolism, along with pathways related to aromatic compound degradation. Overall, dissolved nutrients, dissolved organic carbon (DOC), metal and metalloid ion concentrations as well as structure and functional profiles of biological communities provide a comprehensive basis for evaluating the ecological health of aquaculture ponds. This study generates important baseline information for long-term monitoring and represents the first eDNA-based high-throughput sequencing assessment of IMA and non-IMA aquaculture ponds from surface water in close proximity to the Sundarbans mangrove.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112542"},"PeriodicalIF":1.4,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1016/j.dib.2026.112540
Anshu Raj , Xin Wang , Matthew Luebbe , Haiming Wen , Kun Lu , Shuozhi Xu
We report a curated dataset that brings together composition, processing conditions, microstructural details, and mechanical properties for 396 combinations of alloy composition and processing condition drawn from 100 peer-reviewed research articles on precipitate-containing multi-principal element alloys (MPEAs). The dataset was created by first utilizing a generative large language model for information extraction, followed by expert review to ensure accurate recovery of materials data. Compositional information was taken directly from tables and text, while processing routes — including homogenization, rolling, recrystallization, and aging — were converted into uniform temperature and time metrics. Microstructural descriptors, including precipitate phases and sizes, were consolidated into a consistent labeling scheme to accommodate the wide range of terminology used in published literature. Finally, mechanical property data, such as strength and ductility, were compiled together with the temperatures at which they were measured. These data provide a coherent view of the composition-processing-microstructure-property features explored in existing MPEA research and establish a resource that supports data-driven alloy design as well as future development of automated materials information-extraction methodologies. The complete dataset is available on Zenodo.
{"title":"A dataset of precipitate-containing multi-principal element alloys","authors":"Anshu Raj , Xin Wang , Matthew Luebbe , Haiming Wen , Kun Lu , Shuozhi Xu","doi":"10.1016/j.dib.2026.112540","DOIUrl":"10.1016/j.dib.2026.112540","url":null,"abstract":"<div><div>We report a curated dataset that brings together composition, processing conditions, microstructural details, and mechanical properties for 396 combinations of alloy composition and processing condition drawn from 100 peer-reviewed research articles on precipitate-containing multi-principal element alloys (MPEAs). The dataset was created by first utilizing a generative large language model for information extraction, followed by expert review to ensure accurate recovery of materials data. Compositional information was taken directly from tables and text, while processing routes — including homogenization, rolling, recrystallization, and aging — were converted into uniform temperature and time metrics. Microstructural descriptors, including precipitate phases and sizes, were consolidated into a consistent labeling scheme to accommodate the wide range of terminology used in published literature. Finally, mechanical property data, such as strength and ductility, were compiled together with the temperatures at which they were measured. These data provide a coherent view of the composition-processing-microstructure-property features explored in existing MPEA research and establish a resource that supports data-driven alloy design as well as future development of automated materials information-extraction methodologies. The complete dataset is available on Zenodo.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112540"},"PeriodicalIF":1.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1016/j.dib.2026.112531
Md. Nayem Hossain , Nakib Aman , Nymur Rahaman Antor , Nafisa Tasnim , Md Zubair Azam , Sabab Asfaq
This data article introduces a structured pavement surface image dataset developed to advance research in automated pavement condition assessment and data-driven road infrastructure monitoring. The dataset comprises three distinct pavement condition categories: (i) alligator cracking, (ii) edge-breaking distress, and (iii) undamaged (intact) pavement surfaces, each representing a prevalent form of pavement deterioration or intact condition typically observed in flexible pavements. The dataset consists of 12,000 raw images (4000 per class) collected under real-world conditions. These images represent the primary scientific contribution of the dataset. All images were standardized through resizing and normalization, and the dataset was partitioned into training, validation, and testing subsets to ensure reproducibility and consistency in data-driven experiments. Pavement images were collected from selected segments of National Highway N6 in Pabna District, Bangladesh, under natural daylight conditions using a smartphone camera during field surveys. Image acquisition was conducted following standard safety practices without disrupting traffic flow. All images were manually reviewed and labelled to ensure annotation accuracy. This dataset is intended to support research on automated pavement crack detection and classification, benchmarking of computer vision and deep learning models, and the development of lightweight and edge-deployable inspection systems.
{"title":"Comprehensive image dataset of flexible pavement: Alligator cracks and edge-breaks from national highway (N6) of urban areas","authors":"Md. Nayem Hossain , Nakib Aman , Nymur Rahaman Antor , Nafisa Tasnim , Md Zubair Azam , Sabab Asfaq","doi":"10.1016/j.dib.2026.112531","DOIUrl":"10.1016/j.dib.2026.112531","url":null,"abstract":"<div><div>This data article introduces a structured pavement surface image dataset developed to advance research in automated pavement condition assessment and data-driven road infrastructure monitoring. The dataset comprises three distinct pavement condition categories: (i) alligator cracking, (ii) edge-breaking distress, and (iii) undamaged (intact) pavement surfaces, each representing a prevalent form of pavement deterioration or intact condition typically observed in flexible pavements. The dataset consists of 12,000 raw images (4000 per class) collected under real-world conditions. These images represent the primary scientific contribution of the dataset. All images were standardized through resizing and normalization, and the dataset was partitioned into training, validation, and testing subsets to ensure reproducibility and consistency in data-driven experiments. Pavement images were collected from selected segments of National Highway N6 in Pabna District, Bangladesh, under natural daylight conditions using a smartphone camera during field surveys. Image acquisition was conducted following standard safety practices without disrupting traffic flow. All images were manually reviewed and labelled to ensure annotation accuracy. This dataset is intended to support research on automated pavement crack detection and classification, benchmarking of computer vision and deep learning models, and the development of lightweight and edge-deployable inspection systems.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112531"},"PeriodicalIF":1.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1016/j.dib.2026.112514
Paola Marques, Mariana Mendes, Thiago Emmanuel Pereira, Giovanni Farias
While public cloud providers dominate the commercial landscape, private clouds are widely adopted by academic and research institutions to meet specific governance and operational requirements. There are multiple available datasets about resource usage of public clouds; however, datasets capturing usage patterns in private clouds remain scarce, which limits research in this area. This work presents a dataset comprising over 64 million records collected from a private OpenStack-based cloud operated by the Distributed Systems Laboratory at the Federal University of Campina Grande, Brazil. Data was continuously gathered over nearly twelve months (May 23, 2024 to May 16, 2025), periodically querying OpenStack APIs and monitoring services every five minutes. The dataset captures different aspects of the infrastructure, allocation quotas, user-to-project associations (as OpenStack groups users into projects), server (virtual machines) specifications, and resource utilization for users and projects. Entries are timestamped, enabling temporal analyses of system dynamics. Sensitive attributes, such as user names, project names, IP addresses, and server names were protected, leaving only system-generated UUIDs. By offering a detailed, time-stamped, view of a private cloud, this dataset provides a valuable resource for cloud computing research, helping to bridge the gap in publicly available datasets from non-commercial cloud environments. The dataset is valuable not only for academic institutions but also for companies considering cloud repatriation.
{"title":"Dataset on resource allocation and usage for a private cloud","authors":"Paola Marques, Mariana Mendes, Thiago Emmanuel Pereira, Giovanni Farias","doi":"10.1016/j.dib.2026.112514","DOIUrl":"10.1016/j.dib.2026.112514","url":null,"abstract":"<div><div>While public cloud providers dominate the commercial landscape, private clouds are widely adopted by academic and research institutions to meet specific governance and operational requirements. There are multiple available datasets about resource usage of public clouds; however, datasets capturing usage patterns in private clouds remain scarce, which limits research in this area. This work presents a dataset comprising over 64 million records collected from a private OpenStack-based cloud operated by the Distributed Systems Laboratory at the Federal University of Campina Grande, Brazil. Data was continuously gathered over nearly twelve months (May 23, 2024 to May 16, 2025), periodically querying OpenStack APIs and monitoring services every five minutes. The dataset captures different aspects of the infrastructure, allocation quotas, user-to-project associations (as OpenStack groups users into projects), server (virtual machines) specifications, and resource utilization for users and projects. Entries are timestamped, enabling temporal analyses of system dynamics. Sensitive attributes, such as user names, project names, IP addresses, and server names were protected, leaving only system-generated UUIDs. By offering a detailed, time-stamped, view of a private cloud, this dataset provides a valuable resource for cloud computing research, helping to bridge the gap in publicly available datasets from non-commercial cloud environments. The dataset is valuable not only for academic institutions but also for companies considering cloud repatriation.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112514"},"PeriodicalIF":1.4,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}