Pub Date : 2026-04-01Epub Date: 2026-01-27DOI: 10.1016/j.dib.2026.112518
Tao He, Wei Lu
We present an OpenStreetMap-derived multimodal dataset spanning 23 cities and 11,711 tile-level samples. For each 768 × 768 m tile, we provide an aligned image pair: (i) a stylized ecological baseline that generalizes green and water features together with major roads and railways, and (ii) a target urban morphology map color-coded by functional building classes, transport infrastructure, green space, and water. Each sample includes latitude/longitude; the eight WorldClim v2.1 bioclimatic variables can be reconstructed locally with the provided script. The dataset is organized by city and indexed with JSONL records linking image paths and attributes, enabling direct integration into machine learning pipelines. Cross-city and cross-climate coverage supports training and evaluation of generative models for urban design, comparative analyses of morphology across climate regimes, and imputation of functional footprints in data-scarce regions. The ecological baseline represents a constructed pre-urban template rather than a historical map.
{"title":"OpenStreetMap-derived multimodal dataset across 23 cities: Paired urban morphology tiles with bioclimatic variables","authors":"Tao He, Wei Lu","doi":"10.1016/j.dib.2026.112518","DOIUrl":"10.1016/j.dib.2026.112518","url":null,"abstract":"<div><div>We present an OpenStreetMap-derived multimodal dataset spanning 23 cities and 11,711 tile-level samples. For each 768 × 768 m tile, we provide an aligned image pair: (i) a stylized ecological baseline that generalizes green and water features together with major roads and railways, and (ii) a target urban morphology map color-coded by functional building classes, transport infrastructure, green space, and water. Each sample includes latitude/longitude; the eight WorldClim v2.1 bioclimatic variables can be reconstructed locally with the provided script. The dataset is organized by city and indexed with JSONL records linking image paths and attributes, enabling direct integration into machine learning pipelines. Cross-city and cross-climate coverage supports training and evaluation of generative models for urban design, comparative analyses of morphology across climate regimes, and imputation of functional footprints in data-scarce regions. The ecological baseline represents a constructed pre-urban template rather than a historical map.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112518"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-19DOI: 10.1016/j.dib.2026.112488
J. Samuel Baixauli-Soler , María Belda-Ruiz , Gabriel Lozano-Reina , Juan David Peláez-León , Gregorio Sánchez-Marín
This article presents a dataset on 508 medium-sized Spanish family firms, collected between March and June 2016 through structured telephone interviews with CEOs or HR directors. The questionnaire covered four dimensions: family involvement and socioemotional wealth (SEW), human resource management (HRM) practices, financial strategies, and managerial demographics. To complement survey data, financial indicators were extracted from the SABI (Sistema de Análisis de Balances Ibéricos) database. The dataset integrates subjective managerial assessments with objective firm-level information, offering a unique resource for research on family business management, HRM, and financial policies. Variables include firm ownership and management, generational structures, SEW priorities, human capital and HRM practices, financial goals and capital access, as well as managers’ demographic characteristics. The database is released in cleaned, anonymized, and fully documented form (together with the questionnaire and a detailed codebook), enabling replication, comparative studies, and meta-analyses on family firms and related organizational topics.
本文展示了508家西班牙中型家族企业的数据集,这些数据是在2016年3月至6月期间通过对首席执行官或人力资源总监的结构化电话采访收集的。问卷涵盖四个维度:家庭参与与社会情感财富(SEW)、人力资源管理(HRM)实践、财务策略和管理人口统计。为了补充调查数据,财务指标从SABI (Sistema de Análisis de balesimacrios)数据库中提取。该数据集将主观管理评估与客观公司层面的信息相结合,为家族企业管理、人力资源管理和财务政策的研究提供了独特的资源。变量包括公司所有权和管理、代际结构、SEW优先级、人力资本和人力资源管理实践、财务目标和资本获取,以及管理者的人口特征。该数据库以经过清理、匿名和完整记录的形式发布(连同问卷和详细的代码本),可以对家族企业和相关组织主题进行复制、比较研究和元分析。
{"title":"Dataset on Spanish medium-sized family firms: Linking socioemotional wealth, HRM practices, and financial indicators","authors":"J. Samuel Baixauli-Soler , María Belda-Ruiz , Gabriel Lozano-Reina , Juan David Peláez-León , Gregorio Sánchez-Marín","doi":"10.1016/j.dib.2026.112488","DOIUrl":"10.1016/j.dib.2026.112488","url":null,"abstract":"<div><div>This article presents a dataset on 508 medium-sized Spanish family firms, collected between March and June 2016 through structured telephone interviews with CEOs or HR directors. The questionnaire covered four dimensions: family involvement and socioemotional wealth (SEW), human resource management (HRM) practices, financial strategies, and managerial demographics. To complement survey data, financial indicators were extracted from the SABI (<em>Sistema de Análisis de Balances Ibéricos</em>) database. The dataset integrates subjective managerial assessments with objective firm-level information, offering a unique resource for research on family business management, HRM, and financial policies. Variables include firm ownership and management, generational structures, SEW priorities, human capital and HRM practices, financial goals and capital access, as well as managers’ demographic characteristics. The database is released in cleaned, anonymized, and fully documented form (together with the questionnaire and a detailed codebook), enabling replication, comparative studies, and meta-analyses on family firms and related organizational topics.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112488"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-28DOI: 10.1016/j.dib.2026.112520
Andrea Senese , Saverio De Vito , Elena Esposito , Michele Villari , Giovanni Acampora , Girolamo Di Francia , Antonia Longobardi , Giulia Monteleone
Hydrogen transport involves the safe movement of gaseous hydrogen through industrial pipeline networks, typically between production plants, storage facilities, and distribution centers, and is a key component in the transition toward more sustainable energy sources [1]. Monitoring these networks is essential, as hydrogen is highly flammable and leaks, compressor failures, or delayed component responses can lead to serious accidents, environmental damage, and operational interruptions. Despite the growing interest in this sector, publicly available datasets containing multivariate data on hydrogen transport networks are extremely limited, hindering the development and evaluation of data-driven monitoring methods [[2], [3], [4]]. To address this gap, we present a synthetic dataset simulated using a MATLAB Simscape model of a pipeline segment representative of an industrial network [[5], [6], [7],14]. The dataset includes time-series data from distributed virtual sensors, covering both normal operating conditions and anomalous scenarios such as leaks, compressor failures, and delayed component responses [8,9]. The simulation reproduces transient and steady-state dynamics typical of industrial networks, providing data suitable for the development and evaluation of algorithms for digital twins [10], monitoring, and anomaly detection in hydrogen transport infrastructures [10,11].
{"title":"A simulation-based dataset for anomaly detection in hydrogen blend transport networks","authors":"Andrea Senese , Saverio De Vito , Elena Esposito , Michele Villari , Giovanni Acampora , Girolamo Di Francia , Antonia Longobardi , Giulia Monteleone","doi":"10.1016/j.dib.2026.112520","DOIUrl":"10.1016/j.dib.2026.112520","url":null,"abstract":"<div><div>Hydrogen transport involves the safe movement of gaseous hydrogen through industrial pipeline networks, typically between production plants, storage facilities, and distribution centers, and is a key component in the transition toward more sustainable energy sources [<span><span>1</span></span>]. Monitoring these networks is essential, as hydrogen is highly flammable and leaks, compressor failures, or delayed component responses can lead to serious accidents, environmental damage, and operational interruptions. Despite the growing interest in this sector, publicly available datasets containing multivariate data on hydrogen transport networks are extremely limited, hindering the development and evaluation of data-driven monitoring methods [<span><span>[2]</span></span>, <span><span>[3]</span></span>, <span><span>[4]</span></span>]. To address this gap, we present a synthetic dataset simulated using a MATLAB Simscape model of a pipeline segment representative of an industrial network [<span><span>[5]</span></span>, <span><span>[6]</span></span>, <span><span>[7]</span></span>,<span><span>14</span></span>]. The dataset includes time-series data from distributed virtual sensors, covering both normal operating conditions and anomalous scenarios such as leaks, compressor failures, and delayed component responses [<span><span>8</span></span>,<span><span>9</span></span>]. The simulation reproduces transient and steady-state dynamics typical of industrial networks, providing data suitable for the development and evaluation of algorithms for digital twins [<span><span>10</span></span>], monitoring, and anomaly detection in hydrogen transport infrastructures [<span><span>10</span></span>,<span><span>11</span></span>].</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112520"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-02-02DOI: 10.1016/j.dib.2026.112537
Soufiyan Ouali, Said El Garouani
The healthcare domain constitutes a fundamental pillar of national development, as maintaining population health not only enhances citizens' quality of life but also generates substantial economic benefits through increased productivity, innovation, and workforce participation. However, the healthcare industry faces numerous challenges and barriers that impede universal access to medical services. In low- and middle-income countries, significant portions of the population forego medical consultations due to various socioeconomic constraints, including prohibitive consultation fees, scheduling difficulties, and extended waiting periods. Consequently, there is an urgent need for innovative approaches to optimize healthcare delivery processes. Recent advances in artificial intelligence have demonstrated promising potential in developing intelligent systems that address healthcare accessibility gaps. These innovations include medical chatbots, appointment booking systems, disease-prediction models, and psychiatric virtual assistants. However, such technological enhancements have predominantly focused on high-resource languages, while research in low-resource languages, particularly Arabic, remains in its preliminary stages. This disparity is especially pronounced in Arabic dialects, which differ substantially from Modern Standard Arabic in terms of vocabulary, syntax, and semantic structures. To address this critical gap, we present the first comprehensive dataset for the Moroccan Arabic dialect in the healthcare domain. The MedQA-MA dataset comprises 108,943 question-answer pairs in text format, with each pair categorized according to medical specialty. Including 23 distinct medical specialties, this dataset serves multiple applications, including sentiment analysis, specialty classification, question-answering systems, and the development of human-like medical chatbots. The dataset has been meticulously curated, annotated, and validated by qualified medical professionals, ensuring its reliability and clinical relevance for developing realistic healthcare systems grounded in authentic medical interactions.
The MedQA-MA dataset is publicly available and freely accessible at https://data.mendeley.com/datasets/v6gs7nsy9z/1, representing a significant contribution to Arabic Natural Language Processing research in healthcare applications and facilitating the development of culturally and linguistically appropriate medical AI systems for Arabic-speaking populations.
{"title":"MedQA-MA: A Moroccan Arabic medical question-answering dataset for virtual healthcare assistants and large language models","authors":"Soufiyan Ouali, Said El Garouani","doi":"10.1016/j.dib.2026.112537","DOIUrl":"10.1016/j.dib.2026.112537","url":null,"abstract":"<div><div>The healthcare domain constitutes a fundamental pillar of national development, as maintaining population health not only enhances citizens' quality of life but also generates substantial economic benefits through increased productivity, innovation, and workforce participation. However, the healthcare industry faces numerous challenges and barriers that impede universal access to medical services. In low- and middle-income countries, significant portions of the population forego medical consultations due to various socioeconomic constraints, including prohibitive consultation fees, scheduling difficulties, and extended waiting periods. Consequently, there is an urgent need for innovative approaches to optimize healthcare delivery processes. Recent advances in artificial intelligence have demonstrated promising potential in developing intelligent systems that address healthcare accessibility gaps. These innovations include medical chatbots, appointment booking systems, disease-prediction models, and psychiatric virtual assistants. However, such technological enhancements have predominantly focused on high-resource languages, while research in low-resource languages, particularly Arabic, remains in its preliminary stages. This disparity is especially pronounced in Arabic dialects, which differ substantially from Modern Standard Arabic in terms of vocabulary, syntax, and semantic structures. To address this critical gap, we present the first comprehensive dataset for the Moroccan Arabic dialect in the healthcare domain. The MedQA-MA dataset comprises 108,943 question-answer pairs in text format, with each pair categorized according to medical specialty. Including 23 distinct medical specialties, this dataset serves multiple applications, including sentiment analysis, specialty classification, question-answering systems, and the development of human-like medical chatbots. The dataset has been meticulously curated, annotated, and validated by qualified medical professionals, ensuring its reliability and clinical relevance for developing realistic healthcare systems grounded in authentic medical interactions.</div><div>The MedQA-MA dataset is publicly available and freely accessible at <span><span>https://data.mendeley.com/datasets/v6gs7nsy9z/1</span><svg><path></path></svg></span>, representing a significant contribution to Arabic Natural Language Processing research in healthcare applications and facilitating the development of culturally and linguistically appropriate medical AI systems for Arabic-speaking populations.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112537"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-08DOI: 10.1016/j.dib.2026.112450
Xinchao Song , Mingjun Li , Sean Banerjee , Natasha Kholgade Banerjee
We present the HILO dataset consisting of high-resolution 3D scanned models for 253 common-use objects and 32,256 multi-viewpoint RGB-D images with typically low-resolution data for 144 tabletop scenes consisting of collections of random sets of 10 objects drawn from the set of 253 objects. The dataset provides the 6 degree of freedom (6DOF) pose for all objects found in each of the 32,256 RGB-D images, obtained by performing precise 3D alignment of the 3D models to the RGB-D images. The dataset also contains metadata on object mass, short text descriptor, binning into everyday use classes, and aspect ratio and function categories, intrinsic parameter information for RGB-D sensors used in capture, and transformations between camera poses. Object 3D models in the dataset were acquired by scanning using a tabletop 3D scanner, and were manually inspected, cleaned, repaired, and exported as original ultra high-resolution at ∼1M vertices and simplified high-resolution meshes at ∼10k vertices. To capture the multi-view RGB-D images, we established an in-house testbed consisting of a turntable and two robotic manipulators to respectively cover azimuth angles and elevation angles, and span a hemisphere. Images were captured using two Microsoft Azure Kinect sensors mounted at the wrists of the robot, one per robot. We captured images over two distances forming hemispherical shells. We used in-house software written in python to control the turntable movement, robot motion, and image capture, as well as to perform camera calibration, processing to generate registered images and foreground masks, manual precise alignment of object models to images, and post-capture correction of misalignments in camera transformation parameters. The dataset provides value in enabling training and evaluation of algorithms for several tasks in computer vision, artificial intelligence (AI), and robotics such as object completion, recognition, segmentation, high-resolution structure generation, robotic grasp planning, and recognition of human-preferred grasp locations for human-robot collaboration.
{"title":"Dataset of RGB-D images of object collections from multiple viewpoints with aligned high-resolution 3D models of objects","authors":"Xinchao Song , Mingjun Li , Sean Banerjee , Natasha Kholgade Banerjee","doi":"10.1016/j.dib.2026.112450","DOIUrl":"10.1016/j.dib.2026.112450","url":null,"abstract":"<div><div>We present the HILO dataset consisting of high-resolution 3D scanned models for 253 common-use objects and 32,256 multi-viewpoint RGB-D images with typically low-resolution data for 144 tabletop scenes consisting of collections of random sets of 10 objects drawn from the set of 253 objects. The dataset provides the 6 degree of freedom (6DOF) pose for all objects found in each of the 32,256 RGB-D images, obtained by performing precise 3D alignment of the 3D models to the RGB-D images. The dataset also contains metadata on object mass, short text descriptor, binning into everyday use classes, and aspect ratio and function categories, intrinsic parameter information for RGB-D sensors used in capture, and transformations between camera poses. Object 3D models in the dataset were acquired by scanning using a tabletop 3D scanner, and were manually inspected, cleaned, repaired, and exported as original ultra high-resolution at ∼1M vertices and simplified high-resolution meshes at ∼10k vertices. To capture the multi-view RGB-D images, we established an in-house testbed consisting of a turntable and two robotic manipulators to respectively cover azimuth angles and elevation angles, and span a hemisphere. Images were captured using two Microsoft Azure Kinect sensors mounted at the wrists of the robot, one per robot. We captured images over two distances forming hemispherical shells. We used in-house software written in python to control the turntable movement, robot motion, and image capture, as well as to perform camera calibration, processing to generate registered images and foreground masks, manual precise alignment of object models to images, and post-capture correction of misalignments in camera transformation parameters. The dataset provides value in enabling training and evaluation of algorithms for several tasks in computer vision, artificial intelligence (AI), and robotics such as object completion, recognition, segmentation, high-resolution structure generation, robotic grasp planning, and recognition of human-preferred grasp locations for human-robot collaboration.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112450"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145976672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-08DOI: 10.1016/j.dib.2026.112456
Marie-Liesse Vermeire , Pathé Basse , Samuel Legros , Falilou Diallo , Anne Desnues , Frédéric Feder
Recycling the growing stock of organic waste products (OWP) from cities, factories, and farms is a key challenge for sustainable agriculture. However, it must be done with awareness of performances but also potential long-term environmental and health risks. In this context, the SOERE PRO observatory was established ("Systèmes d'Observation et d'Expérimentation pour la Recherche en Environnement - Produits Résiduaires Organiques'', a label granted by the French National Research Alliance for the Environment (AllEnvi) to recognize high-quality research infrastructures, which translates to "Long-term Observation and Experimentation Systems for Environmental Research - Organic Waste Products''), including the trial in Sangalkam, in the Dakar region of Senegal, where these data are collected. Since 2016, four fertilizer types - one mineral (synthetic) and three organic - have been applied annually to three successive vegetable crops (tomato, lettuce, carrot). The dataset currently covers the period 2016 - 2025, with data collection ongoing and new data to be added in the future. Manual weeding and hoeing is carried out regularly for each crop, no pesticides are used for crop protection on the trial. A comprehensive, multi-variable dataset is consistently documented, including soil physico-chemical parameters measured annually at three depths, organic waste product characterization, crop yield and quality parameters, and detailed management activities, making it particularly suitable for process-based modelling and long-term impact assessment. The originality of this dataset lies in its long duration, the diversity of organic and mineral fertilization strategies, the inclusion of multiple vegetable crops per year, and its location under Sub-Sahelian conditions, a context for which long-term agronomic datasets remain scarce. All soil, OWP and vegetables samples are stored in a sample bank in Dakar, and available for additional analyses. The objective of this dataset is to provide long-term, integrated information on crop productivity, crop quality, and soil responses to repeated organic and mineral fertilization in a Sub-Sahelian market-gardening system. The dataset is publicly available through a Dataverse repository for free (re)use in meta-analyses, process-based modelling, and environmental studies, notably to improve understanding of nutrient cycling, contaminant dynamics, soil biodiversity, and long-term soil functioning in Sub-Sahelian agroecosystems, and to support sustainable land management and food security in Southern countries under future climate change.
{"title":"Soil and crop data from a long-term organic fertilization trial in Sub-Sahelian market gardening","authors":"Marie-Liesse Vermeire , Pathé Basse , Samuel Legros , Falilou Diallo , Anne Desnues , Frédéric Feder","doi":"10.1016/j.dib.2026.112456","DOIUrl":"10.1016/j.dib.2026.112456","url":null,"abstract":"<div><div>Recycling the growing stock of organic waste products (OWP) from cities, factories, and farms is a key challenge for sustainable agriculture. However, it must be done with awareness of performances but also potential long-term environmental and health risks. In this context, the SOERE PRO observatory was established (\"Systèmes d'Observation et d'Expérimentation pour la Recherche en Environnement - Produits Résiduaires Organiques'', a label granted by the French National Research Alliance for the Environment (AllEnvi) to recognize high-quality research infrastructures, which translates to \"Long-term Observation and Experimentation Systems for Environmental Research - Organic Waste Products''), including the trial in Sangalkam, in the Dakar region of Senegal, where these data are collected. Since 2016, four fertilizer types - one mineral (synthetic) and three organic - have been applied annually to three successive vegetable crops (tomato, lettuce, carrot). The dataset currently covers the period 2016 - 2025, with data collection ongoing and new data to be added in the future. Manual weeding and hoeing is carried out regularly for each crop, no pesticides are used for crop protection on the trial. A comprehensive, multi-variable dataset is consistently documented, including soil physico-chemical parameters measured annually at three depths, organic waste product characterization, crop yield and quality parameters, and detailed management activities, making it particularly suitable for process-based modelling and long-term impact assessment. The originality of this dataset lies in its long duration, the diversity of organic and mineral fertilization strategies, the inclusion of multiple vegetable crops per year, and its location under Sub-Sahelian conditions, a context for which long-term agronomic datasets remain scarce. All soil, OWP and vegetables samples are stored in a sample bank in Dakar, and available for additional analyses. The objective of this dataset is to provide long-term, integrated information on crop productivity, crop quality, and soil responses to repeated organic and mineral fertilization in a Sub-Sahelian market-gardening system. The dataset is publicly available through a Dataverse repository for free (re)use in meta-analyses, process-based modelling, and environmental studies, notably to improve understanding of nutrient cycling, contaminant dynamics, soil biodiversity, and long-term soil functioning in Sub-Sahelian agroecosystems, and to support sustainable land management and food security in Southern countries under future climate change.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112456"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145976671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-24DOI: 10.1016/j.dib.2026.112498
Paulina Deptula , Jenni Sihvola , Pekka Varmanen
This dataset reports the complete genome sequence of Propionibacterium freudenreichii strain J117, a food-grade bacterium isolated from Austrian Vorarlberger Bergkäs cheese. The strain was selected for its application in a co-fermentation platform aimed at enhancing vitamin B12 content in plant-based fermented foods. Genomic DNA was extracted from anaerobic cultures grown in yeast extract lactate (YEL) broth and sequenced using PacBio Sequel II long-read technology with SMRT Cell 8 M. High-fidelity (HiFi) reads were generated, and circular consensus sequences (CCS) were assembled using the Improved Phased Assembler (IPA v2).
Genome annotation was performed with Bakta v1.10.4. Antibiotic resistance screening was carried out using the Resistance Gene Identifier (RGI v6.0.3) from the Comprehensive Antibiotic Resistance Database (CARD) via the PROKSEE platform. No plasmid-encoded resistance determinants were identified. The genome comprises two circular replicons and includes full annotation of coding sequences, RNAs, CRISPR array, and pseudogenes.
The raw sequencing data, genome assembly files, and annotation outputs are included in the associated data repository, organized in subfolders for raw reads, assemblies, and analysis results. This dataset supports the related research article: Zhang, R., Chen, L., Zhang, D., Sihvola, J., Chamlagain, B., Olin, M., Piironen, V., & Varmanen, P. Innovative co-fermentation of Propionibacterium freudenreichii and Rhizopus oryzae enhances vitamin B12, riboflavin, and flavor profile components in sweet fermented glutinous rice. Food Chemistry, 503 (2026).
The availability of this genome provides a reference for comparative genomic analysis, functional pathway prediction, and strain development. It also facilitates safety assessment of food-related strains, such as the absence of mobile antibiotic resistance genes, thereby supporting the transparent use of J117 in fermented food applications.
该数据集报道了从奥地利Vorarlberger Bergkäs奶酪中分离出的一种食品级细菌——弗氏丙酸杆菌J117菌株的完整基因组序列。选择该菌株用于旨在提高植物性发酵食品中维生素B12含量的共发酵平台。从酵母提取物乳酸(YEL)培养液中厌氧培养物中提取基因组DNA,使用PacBio Sequel II长读技术与SMRT Cell 8 m进行测序,生成高保真(HiFi)读段,并使用改进的分阶段组装器(IPA v2)组装环状一致序列(CCS)。使用Bakta v1.10.4进行基因组注释。通过PROKSEE平台,使用抗生素耐药综合数据库(CARD)中的耐药基因标识符(RGI v6.0.3)进行抗生素耐药筛选。未发现质粒编码的抗性决定因素。基因组由两个圆形复制子组成,包括编码序列、rna、CRISPR阵列和假基因的完整注释。原始测序数据、基因组组装文件和注释输出包含在关联的数据存储库中,并组织在用于原始读取、组装和分析结果的子文件夹中。该数据集支持相关研究文章:Zhang, R., Chen, L., Zhang, D., Sihvola, J., Chamlagain, B., Olin, M., Piironen, V., Varmanen, P.,创新的弗氏丙酸杆菌和米根霉共发酵提高了甜发酵糯中的维生素B12、核黄素和风味成分。食品化学,2003,26(3):326 - 326。该基因组的可用性为比较基因组分析、功能途径预测和菌株开发提供了参考。它还有助于食品相关菌株的安全评估,例如不存在流动抗生素耐药基因,从而支持J117在发酵食品应用中的透明使用。
{"title":"Genome data of Propionibacterium freudenreichii J117, a functional strain from raw-milk cheese","authors":"Paulina Deptula , Jenni Sihvola , Pekka Varmanen","doi":"10.1016/j.dib.2026.112498","DOIUrl":"10.1016/j.dib.2026.112498","url":null,"abstract":"<div><div>This dataset reports the complete genome sequence of <em>Propionibacterium freudenreichii</em> strain J117, a food-grade bacterium isolated from Austrian Vorarlberger Bergkäs cheese. The strain was selected for its application in a co-fermentation platform aimed at enhancing vitamin B12 content in plant-based fermented foods. Genomic DNA was extracted from anaerobic cultures grown in yeast extract lactate (YEL) broth and sequenced using PacBio Sequel II long-read technology with SMRT Cell 8 M. High-fidelity (HiFi) reads were generated, and circular consensus sequences (CCS) were assembled using the Improved Phased Assembler (IPA v2).</div><div>Genome annotation was performed with Bakta v1.10.4. Antibiotic resistance screening was carried out using the Resistance Gene Identifier (RGI v6.0.3) from the Comprehensive Antibiotic Resistance Database (CARD) via the PROKSEE platform. No plasmid-encoded resistance determinants were identified. The genome comprises two circular replicons and includes full annotation of coding sequences, RNAs, CRISPR array, and pseudogenes.</div><div>The raw sequencing data, genome assembly files, and annotation outputs are included in the associated data repository, organized in subfolders for raw reads, assemblies, and analysis results. This dataset supports the related research article: Zhang, R., Chen, L., Zhang, D., Sihvola, J., Chamlagain, B., Olin, M., Piironen, V., & Varmanen, P. Innovative co-fermentation of <em>Propionibacterium freudenreichii</em> and <em>Rhizopus oryzae</em> enhances vitamin B12, riboflavin, and flavor profile components in sweet fermented glutinous rice. <em>Food Chemistry</em>, 503 (2026).</div><div>The availability of this genome provides a reference for comparative genomic analysis, functional pathway prediction, and strain development. It also facilitates safety assessment of food-related strains, such as the absence of mobile antibiotic resistance genes, thereby supporting the transparent use of J117 in fermented food applications.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112498"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146178306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-27DOI: 10.1016/j.dib.2026.112519
Mychael Maoeretz Engel , Ford Lumban Gaol , Aditya Kurniawan , Widodo Budiharto
The dataset presented in this article comprises anonymous transactional records and associated product metadata collected from a local Food and Beverages (F&B) Micro-Small-Medium Enterprise (MSME) operating in a local city in Indonesia. This data can be used by researchers, data scientists, and industry professionals using various techniques in recommender system and machine learning. The data acquisition process involved the passive logging of sales events through the business's internal Point-of-Sale (POS) system from January 2025 to September 2025. The raw data, initially containing a comprehensive transaction log and a detailed product catalog, underwent a cleaning and structuring protocol. The process needed the elimination of seven unneeded features which included blank customer details and duplicate monetary entries and uniform product description formatting. The dataset lacks any distinctive customer identification numbers because it does not contain Customer IDs. The RFM analysis of 1000 transactions through session context grouping produced 14 pseudo-profiles which showed stability as behavioral indicators for unidentifiable users. The final data package consists of two relational tables which include the Transactions Table with 11 core features (Outlet, Date, Time and Total Amount) and the Products Metadata Table with definitions for 96 individual products. The available data in the dataset allows researchers to perform studies about retail analytics and recommender systems. Specifically, it supports the development and benchmarking of algorithms designed for session-based recommendation and the creation of user segmentation models in anonymous, data-sparse environments typical of the MSME retail sector.
{"title":"Anonymous transactional dataset in a local food and beverages (F&B) micro-small-medium enterprise (MSME) for recommender systems","authors":"Mychael Maoeretz Engel , Ford Lumban Gaol , Aditya Kurniawan , Widodo Budiharto","doi":"10.1016/j.dib.2026.112519","DOIUrl":"10.1016/j.dib.2026.112519","url":null,"abstract":"<div><div>The dataset presented in this article comprises anonymous transactional records and associated product metadata collected from a local Food and Beverages (F&B) Micro-Small-Medium Enterprise (MSME) operating in a local city in Indonesia. This data can be used by researchers, data scientists, and industry professionals using various techniques in recommender system and machine learning. The data acquisition process involved the passive logging of sales events through the business's internal Point-of-Sale (POS) system from January 2025 to September 2025. The raw data, initially containing a comprehensive transaction log and a detailed product catalog, underwent a cleaning and structuring protocol. The process needed the elimination of seven unneeded features which included blank customer details and duplicate monetary entries and uniform product description formatting. The dataset lacks any distinctive customer identification numbers because it does not contain Customer IDs. The RFM analysis of 1000 transactions through session context grouping produced 14 pseudo-profiles which showed stability as behavioral indicators for unidentifiable users. The final data package consists of two relational tables which include the Transactions Table with 11 core features (Outlet, Date, Time and Total Amount) and the Products Metadata Table with definitions for 96 individual products. The available data in the dataset allows researchers to perform studies about retail analytics and recommender systems. Specifically, it supports the development and benchmarking of algorithms designed for session-based recommendation and the creation of user segmentation models in anonymous, data-sparse environments typical of the MSME retail sector.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112519"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146184681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-10DOI: 10.1016/j.dib.2026.112459
Luiza N. Loges , Ricardo DeMoya , Valentina Laverde, Saulius Sumanas
Fli1b is an ETS transcription factor, which has been previously implicated in zebrafish vascular and hematopoietic development. Here we present single cell RNA sequencing data from wild-type and maternal zygotic fli1b mutant zebrafish embryos at 24 h post fertilization. Single-cell suspensions were obtained from approximately 40 whole maternal-zygotic (MZ) fli1b mutant and sibling parent wild-type embryos and subjected to RNA sequencing using the 10X Genomics Chromium platform. Following bioinformatic analysis, 34 distinct cell clusters were identified in the integrated wild-type and fli1b mutant dataset. The clusters were subsequently annotated based on expression of marker genes. These data will be valuable for further studies of the molecular mechanisms involved in vascular and hematopoietic development. In addition, the obtained transcriptomes of multiple cell types will be useful to investigate other developmental mechanisms in zebrafish and other models.
{"title":"Single-cell RNA-seq data of wild type and fli1b mutant zebrafish embryos","authors":"Luiza N. Loges , Ricardo DeMoya , Valentina Laverde, Saulius Sumanas","doi":"10.1016/j.dib.2026.112459","DOIUrl":"10.1016/j.dib.2026.112459","url":null,"abstract":"<div><div>Fli1b is an ETS transcription factor, which has been previously implicated in zebrafish vascular and hematopoietic development. Here we present single cell RNA sequencing data from wild-type and maternal zygotic <em>fli1b</em> mutant zebrafish embryos at 24 h post fertilization. Single-cell suspensions were obtained from approximately 40 whole maternal-zygotic (MZ) <em>fli1b</em> mutant and sibling parent wild-type embryos and subjected to RNA sequencing using the 10X Genomics Chromium platform. Following bioinformatic analysis, 34 distinct cell clusters were identified in the integrated wild-type and <em>fli1b</em> mutant dataset. The clusters were subsequently annotated based on expression of marker genes. These data will be valuable for further studies of the molecular mechanisms involved in vascular and hematopoietic development. In addition, the obtained transcriptomes of multiple cell types will be useful to investigate other developmental mechanisms in zebrafish and other models.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112459"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146036158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-14DOI: 10.1016/j.dib.2026.112472
Leandro da Silva Gomes , Gustavo Henrique de Queiroz Stabile , Tahisa Neitzel Kuck , Felipe Augusto Pereira de Figueiredo , Elcio Hideiti Shiguemori , Dimas Irion Alves
The Brazilian Amazon Rainforest holds a large ecological and economic importance and is considered one of the most biodiverse regions on the planet. The region faces numerous challenges from illegal human activities that threaten its sustainability and well-being, which are often supported by the construction of unauthorized airstrips. Additionally, due to its persistent cloud cover, which often hinders monitoring with optical satellites, Synthetic Aperture Radar (SAR) imagery provides a crucial alternative for the region surveillance. Thus, this dataset was developed to support the training and evaluation of machine learning techniques, including deep learning models for detecting and segmenting airstrips in the Brazilian Amazon Rainforest using SAR imagery. The dataset comprises images from the Sentinel-1 satellite, acquired primarily between 2021 and 2024, covering 1040 locations of known airstrips sourced from the MapBiomas project (published in 2023, based on 2021 reference data). For the change detection task, historical “before” images were selected from the period between 2014 and 2021 to capture the pre-construction state. The data is structured to support three distinct machine learning tasks: object detection (e.g., YOLOv8), semantic segmentation (e.g., U-Net), and change detection. For each task, specific images and annotations are provided. Additionally, geospatial files (Shapefile, GeoPackage) are included to facilitate the integration and visualization of the dataset in a GIS environment. The data is valuable for researchers in remote sensing, computer vision, environmental monitoring, security and defense, enabling the development of automated systems to monitor irregular activities in remote forest regions. The dataset is available at a Mendeley Data repository: https://data.mendeley.com/datasets/x7rn78ymtn/1
{"title":"A Sentinel-1 SAR imagery dataset for airstrips detection and segmentation in the Brazilian Amazon Rainforest","authors":"Leandro da Silva Gomes , Gustavo Henrique de Queiroz Stabile , Tahisa Neitzel Kuck , Felipe Augusto Pereira de Figueiredo , Elcio Hideiti Shiguemori , Dimas Irion Alves","doi":"10.1016/j.dib.2026.112472","DOIUrl":"10.1016/j.dib.2026.112472","url":null,"abstract":"<div><div>The Brazilian Amazon Rainforest holds a large ecological and economic importance and is considered one of the most biodiverse regions on the planet. The region faces numerous challenges from illegal human activities that threaten its sustainability and well-being, which are often supported by the construction of unauthorized airstrips. Additionally, due to its persistent cloud cover, which often hinders monitoring with optical satellites, Synthetic Aperture Radar (SAR) imagery provides a crucial alternative for the region surveillance. Thus, this dataset was developed to support the training and evaluation of machine learning techniques, including deep learning models for detecting and segmenting airstrips in the Brazilian Amazon Rainforest using SAR imagery. The dataset comprises images from the Sentinel-1 satellite, acquired primarily between 2021 and 2024, covering 1040 locations of known airstrips sourced from the MapBiomas project (published in 2023, based on 2021 reference data). For the change detection task, historical “before” images were selected from the period between 2014 and 2021 to capture the pre-construction state. The data is structured to support three distinct machine learning tasks: object detection (e.g., YOLOv8), semantic segmentation (e.g., U-Net), and change detection. For each task, specific images and annotations are provided. Additionally, geospatial files (Shapefile, GeoPackage) are included to facilitate the integration and visualization of the dataset in a GIS environment. The data is valuable for researchers in remote sensing, computer vision, environmental monitoring, security and defense, enabling the development of automated systems to monitor irregular activities in remote forest regions. The dataset is available at a Mendeley Data repository: <span><span>https://data.mendeley.com/datasets/x7rn78ymtn/1</span><svg><path></path></svg></span></div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112472"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146036171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}