Pub Date : 2026-04-01Epub Date: 2026-01-24DOI: 10.1016/j.dib.2026.112505
Mega Laksmini Syamsuddin , Umar Abdurrahman , Ajeng Riska Puspita , Sunarto , Qurnia Wulan Sari , Fadli Syamsudin , Indrawan Fadhil Pratyaksa , Iqbal Maulana Cipta , Ivonne Milichristi Radjawane , Hansan Park
This article presents a high-resolution UAV–LiDAR dataset acquired over the main coastal tourism hotspots of Pangandaran, West Java, Indonesia (WGS84 / UTM Zone 49S). The survey was conducted using a DJI Matrice 300 RTK equipped with a CHCNAV AA450 LiDAR system at altitudes of 77–83 m AGL, following grid-based flight lines with 80% forward and 70% side overlap. The final point cloud, delivered in LAS format, exhibits a mean density of approximately 865 pts/m², with dominant values of 600–800 pts/m² across roads, roofs, and open terrain, and localized peaks exceeding 3,000 pts/m² in areas of flight-line overlap. Ground control was established using three static base stations, with 14 calibration control points and 8 independent validation check points. Accuracy assessment yields RMSE values of 0.072 m (Easting), 0.062 m (Northing), and 0.138 m (Elevation), with corresponding mean biases of 0.017 m, 0.017 m, and 0.044 m, confirming centimeter-level positional precision suitable for detailed coastal mapping. The dataset includes DSM and DTM derivatives, block-based tiles, metadata, and processing reports, supporting its use in tsunami exposure assessment, climate-risk valuation, urban coastal planning, and remote-sensing education. As one of the first openly accessible UAV–LiDAR datasets for an Indonesian coastal tourism hotspot, it provides a reproducible, high-density 3D resource for research, hazard analysis, and sustainable coastal development.
本文介绍了在印度尼西亚西爪哇邦干达兰主要沿海旅游热点(WGS84 / UTM区49S)获取的高分辨率无人机-激光雷达数据集。该调查使用了一架配备CHCNAV AA450激光雷达系统的大疆矩阵300 RTK飞机,飞行高度为77-83米,飞行高度为80%向前重叠,70%侧面重叠。最终的点云以LAS格式交付,其平均密度约为865 pts/m²,在道路、屋顶和开阔地形上的主要值为600-800 pts/m²,在航线重叠区域的局部峰值超过3,000 pts/m²。地面控制采用3个静态基站,14个校准控制点和8个独立验证检查点。精度评估的RMSE值分别为0.072 m (east)、0.062 m (north)和0.138 m (Elevation),相应的平均偏差分别为0.017 m、0.017 m和0.044 m,确定了适合沿海详细制图的厘米级定位精度。该数据集包括DSM和DTM衍生产品、基于块的瓦片、元数据和处理报告,支持其在海啸暴露评估、气候风险评估、城市沿海规划和遥感教育中的应用。作为印尼沿海旅游热点地区首批可公开访问的无人机-激光雷达数据集之一,它为研究、危害分析和沿海可持续发展提供了可复制的高密度3D资源。
{"title":"UAV-LiDAR dataset of Pangandaran coastal tourism hotspots for tsunami and climate risk valuation and exposure mapping","authors":"Mega Laksmini Syamsuddin , Umar Abdurrahman , Ajeng Riska Puspita , Sunarto , Qurnia Wulan Sari , Fadli Syamsudin , Indrawan Fadhil Pratyaksa , Iqbal Maulana Cipta , Ivonne Milichristi Radjawane , Hansan Park","doi":"10.1016/j.dib.2026.112505","DOIUrl":"10.1016/j.dib.2026.112505","url":null,"abstract":"<div><div>This article presents a high-resolution UAV–LiDAR dataset acquired over the main coastal tourism hotspots of Pangandaran, West Java, Indonesia (WGS84 / UTM Zone 49S). The survey was conducted using a DJI Matrice 300 RTK equipped with a CHCNAV AA450 LiDAR system at altitudes of 77–83 m AGL, following grid-based flight lines with 80% forward and 70% side overlap. The final point cloud, delivered in LAS format, exhibits a mean density of approximately 865 pts/m², with dominant values of 600–800 pts/m² across roads, roofs, and open terrain, and localized peaks exceeding 3,000 pts/m² in areas of flight-line overlap. Ground control was established using three static base stations, with 14 calibration control points and 8 independent validation check points. Accuracy assessment yields RMSE values of 0.072 m (Easting), 0.062 m (Northing), and 0.138 m (Elevation), with corresponding mean biases of 0.017 m, 0.017 m, and 0.044 m, confirming centimeter-level positional precision suitable for detailed coastal mapping. The dataset includes DSM and DTM derivatives, block-based tiles, metadata, and processing reports, supporting its use in tsunami exposure assessment, climate-risk valuation, urban coastal planning, and remote-sensing education. As one of the first openly accessible UAV–LiDAR datasets for an Indonesian coastal tourism hotspot, it provides a reproducible, high-density 3D resource for research, hazard analysis, and sustainable coastal development.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112505"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146178266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-29DOI: 10.1016/j.dib.2026.112511
Yuan Ting Chong , Yessane Shrrie Nagendhra Rao , Rehman Ullah Khan , Chee Siong Teh , Mohamad Hardyman Barawi , Mohd Shahrizal Sunar , Joan Jo Jo Sim
Sign languages all around the world are unique and diverse. Each sign language shows the differences in cultural nuances of its origin locale giving it is distinctive nature. Thus, despite the positive outcomes of sign language recognition and translation research that has been widely conducted worldwide, there are still notable limitations to each system which are mainly caused by data limitations. The sign language recognition and translation research in Malaysia especially has been set back by the limited size and nature of datasets available that are concurrent with current technological developments. The current datasets available for Malaysian Sign Language (BIM – Bahasa Isyarat Malaysia) are small and limited to fingerspelling of alphanumeric characters and several dynamic words and short phrases. However, given the continuous nature of the sign language communication, these data are not enough to properly train machine learning models to recognize and translate continuous real-world signs. Therefore, in order to address this issue, we introduce a dynamic BIM dataset which comprises of video, gloss, and translation data consisting of alphanumeric characters, dynamic words and short phrases, and continuous sentences. The dataset is split into two versions. The first version, BIM-SSD-V1 dataset comprises of 4,858 parallel video (RGB frames), gloss, and translation data while the second version, BIM-SSD-V2 dataset comprises of 3,143 parallel video (RGB frames), keypoints and gloss data for recognition purposes, and 4,900 parallel gloss and translation data for translation purposes. The raw videos are also available in the dataset. The dataset was developed and compiled with the help of the Deaf and Hard-of-Hearing community. This process also included the development of a Sign Language Module (translations for the video and gloss data) to assist in the development of the dataset. The image and video data were collected using smartphones and the respective gloss annotations for the data were prepared with the help of a BIM expert. The data collection process was designed to reflect everyday communication scenarios by incorporating varied sentence constructions, repeated signing instances, and recordings under different backgrounds and contextual conditions to introduce data-level variability relevant to real-world use. The total number of participants involved in the data collection process was four. There are also four samples for every character, word, phrase or sentence in the Sign Language Module. The dataset can mainly be reused by researchers who would like to conduct sign language recognition and translation research using the Sign-to-Gloss-to-Text framework. However, the dataset is not limited to only one framework and can be used for other sign language recognition and translation research frameworks accordingly.
世界各地的手语都是独特而多样的。每一种手语都显示出其起源地区的文化细微差别,赋予其独特的性质。因此,尽管在世界范围内广泛开展的手语识别和翻译研究取得了积极成果,但每个系统仍然存在明显的局限性,主要是由于数据的限制。马来西亚的手语识别和翻译研究尤其受到现有数据集规模和性质的限制,而这些数据集又与当前的技术发展同步。目前可用于马来西亚手语(BIM - Bahasa Isyarat Malaysia)的数据集很小,并且仅限于字母数字字符的手指拼写和几个动态单词和短语。然而,鉴于手语交流的连续性,这些数据不足以正确训练机器学习模型来识别和翻译连续的现实世界符号。因此,为了解决这个问题,我们引入了一个动态BIM数据集,该数据集包括视频、注释和由字母数字字符、动态单词和短语以及连续句组成的翻译数据。数据集被分成两个版本。第一版BIM-SSD-V1数据集包括4858个平行视频(RGB帧)、光泽度和翻译数据,第二版BIM-SSD-V2数据集包括3143个平行视频(RGB帧)、关键点和光泽度数据(用于识别),以及4900个平行光泽度和翻译数据(用于翻译)。原始视频也可以在数据集中使用。该数据集是在聋人和听障人士社区的帮助下开发和编译的。该过程还包括开发手语模块(视频和注释数据的翻译),以协助数据集的开发。使用智能手机收集图像和视频数据,并在BIM专家的帮助下为数据准备了相应的注释。数据收集过程旨在通过结合不同的句子结构、重复的签名实例和不同背景和上下文条件下的记录来反映日常交流场景,以引入与现实世界使用相关的数据级可变性。参与数据收集过程的参与者总数为4人。手语模块中的每个字符、单词、短语或句子也有四个示例。该数据集主要供希望使用符号-光泽-文本框架进行手语识别和翻译研究的研究人员重用。然而,该数据集不仅限于一个框架,还可以用于其他手语识别和翻译研究框架。
{"title":"A dynamic Malaysian sign language dataset for sign language recognition and translation","authors":"Yuan Ting Chong , Yessane Shrrie Nagendhra Rao , Rehman Ullah Khan , Chee Siong Teh , Mohamad Hardyman Barawi , Mohd Shahrizal Sunar , Joan Jo Jo Sim","doi":"10.1016/j.dib.2026.112511","DOIUrl":"10.1016/j.dib.2026.112511","url":null,"abstract":"<div><div>Sign languages all around the world are unique and diverse. Each sign language shows the differences in cultural nuances of its origin locale giving it is distinctive nature. Thus, despite the positive outcomes of sign language recognition and translation research that has been widely conducted worldwide, there are still notable limitations to each system which are mainly caused by data limitations. The sign language recognition and translation research in Malaysia especially has been set back by the limited size and nature of datasets available that are concurrent with current technological developments. The current datasets available for Malaysian Sign Language (BIM – Bahasa Isyarat Malaysia) are small and limited to fingerspelling of alphanumeric characters and several dynamic words and short phrases. However, given the continuous nature of the sign language communication, these data are not enough to properly train machine learning models to recognize and translate continuous real-world signs. Therefore, in order to address this issue, we introduce a dynamic BIM dataset which comprises of video, gloss, and translation data consisting of alphanumeric characters, dynamic words and short phrases, and continuous sentences. The dataset is split into two versions. The first version, BIM-SSD-V1 dataset comprises of 4,858 parallel video (RGB frames), gloss, and translation data while the second version, BIM-SSD-V2 dataset comprises of 3,143 parallel video (RGB frames), keypoints and gloss data for recognition purposes, and 4,900 parallel gloss and translation data for translation purposes. The raw videos are also available in the dataset. The dataset was developed and compiled with the help of the Deaf and Hard-of-Hearing community. This process also included the development of a Sign Language Module (translations for the video and gloss data) to assist in the development of the dataset. The image and video data were collected using smartphones and the respective gloss annotations for the data were prepared with the help of a BIM expert. The data collection process was designed to reflect everyday communication scenarios by incorporating varied sentence constructions, repeated signing instances, and recordings under different backgrounds and contextual conditions to introduce data-level variability relevant to real-world use. The total number of participants involved in the data collection process was four. There are also four samples for every character, word, phrase or sentence in the Sign Language Module. The dataset can mainly be reused by researchers who would like to conduct sign language recognition and translation research using the Sign-to-Gloss-to-Text framework. However, the dataset is not limited to only one framework and can be used for other sign language recognition and translation research frameworks accordingly.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112511"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146184682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-02DOI: 10.1016/j.dib.2025.112429
Soha Zarbah, Arwa Wali, Dimah Alahmadi
The legal sector remains distinctive due to the complex language structure and specialized terminology of legal data. This complexity offers considerable contextual information, which demands natural language processing (NLP). The availability of high-quality and well-structured legal datasets is essential for advancing NLP research and applications within the legal field. However, a gap exists within the Arabic legal NLP owing to insufficient research and datasets. To address this gap, we aim to propose an Arabic legal case dataset containing cases, case summaries, relevant keywords, and case categories. The legal case data were obtained from the Board of Grievances website in Saudi Arabia and include 3170 cases distributed across 47 classes. The number of words in these cases varies significantly, ranging from about 100 to nearly 30,000 words per case. Moreover, the number of pages varies, ranging from one page to 80 pages per case. Therefore, this dataset supports various NLP applications, including text categorization, data extraction, sentiment analysis, and summarization, thereby improving task efficiency and decision accuracy in the legal profession.
{"title":"Legal case documents: A comprehensive dataset for Arabic natural language processing research and applications","authors":"Soha Zarbah, Arwa Wali, Dimah Alahmadi","doi":"10.1016/j.dib.2025.112429","DOIUrl":"10.1016/j.dib.2025.112429","url":null,"abstract":"<div><div>The legal sector remains distinctive due to the complex language structure and specialized terminology of legal data. This complexity offers considerable contextual information, which demands natural language processing (NLP). The availability of high-quality and well-structured legal datasets is essential for advancing NLP research and applications within the legal field. However, a gap exists within the Arabic legal NLP owing to insufficient research and datasets. To address this gap, we aim to propose an Arabic legal case dataset containing cases, case summaries, relevant keywords, and case categories. The legal case data were obtained from the Board of Grievances website in Saudi Arabia and include 3170 cases distributed across 47 classes. The number of words in these cases varies significantly, ranging from about 100 to nearly 30,000 words per case. Moreover, the number of pages varies, ranging from one page to 80 pages per case. Therefore, this dataset supports various NLP applications, including text categorization, data extraction, sentiment analysis, and summarization, thereby improving task efficiency and decision accuracy in the legal profession.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112429"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146036166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-08DOI: 10.1016/j.dib.2026.112454
Bidyut R. Mohapatra, Linel S. Moralez, Kiya E. James
This study reports the whole-genome sequence data and functional annotations of a novel Stutzerimonas marianensis strain LB-0542 isolated from the decomposing pelagic Sargassum biomass stranded on Long Beach, Barbados. The genomic DNA was sequenced with the Illumina NextSeq2000 platform. The genome assembly was performed with the SPAdes Genome Assembler (ver 3.15.5). The assembled genome has a size of 4520,813 bp, a coverage of 110X, a GC content of 63.2 %, a L50 of 2 and a N50 of 1079,143 bp. The genome consists of 12 contigs, 0 CRISPR, 3 rRNA, 56 tRNA and 4166 CDSs (coding sequences) with a coding ratio of 89.4 %. The genome annotation results for the COG (cluster of orthologous genes) and subsystem features indicate that the metabolism and the amino acids and derivatives are the most dominant categories, respectively. The analysis of the genome for the existence of Carbohydrate-Active Enzymes (CAZymes) identified 230 genes encoding four functional classes of CAZymes [glycoside hydrolases (75 genes), glycosyltransferases (95 genes), carbohydrate esterases (9 genes) and carbohydrate-binding modules (51 genes)]. The functional annotation of the genome for plastic degradation revealed the presence of 34 genes, which could catalyse the degradation process of 14 types of plastics, polyethylene glycol [PEG (29 %)], polylactic acid [PLA (11 %)], poly(3-hydroxybutyrate-co-3-hydroxyvalerate) [PHBV (9 %)], polyhydroxyalkanoates [PHA (9 %)], polyethylene [PE (6 %)], polycaprolactone [PCL (6 %)], polyethersulfone [PES (6 %)], polyethylene terephthalate [PET (6 %)], poly(butylene adipate-co-terephthalate [PBAT (3 %)], (polystyrene [PS (3 %)], polybutylene succinate [PBSA (3 %)], poly(3-hydroxyvalerate) [P3HV (3 %)], polyvinyl alcohol [PVA (3 %)] and natural rubber [NR (3 %)]. The genome mining for plant growth-promoting traits identified 3175 genes that are associated with the colonizing plant system (26 %), competitive exclusion (21 %), stress control (21 %), biofertilization (14 %), phytohormone and plant signal production (10 %), bioremediation (7 %) and plant immune response stimulation (1 %). These genome mining results are an indication of the biotechnological and ecological significance of the novel strain LB-0542 for sustainable biocatalytic processing of Sargassum and plastic-containing waste. The genome sequence data is available in DDBJ/EMBL/GenBank with the accession number BAAIAE000000000.
{"title":"Genome data mining of a novel Stutzerimonas marianensis strain LB-0542 isolated from pelagic Sargassum seaweed waste for plastic-degrading and plant growth-promoting traits","authors":"Bidyut R. Mohapatra, Linel S. Moralez, Kiya E. James","doi":"10.1016/j.dib.2026.112454","DOIUrl":"10.1016/j.dib.2026.112454","url":null,"abstract":"<div><div>This study reports the whole-genome sequence data and functional annotations of a novel <em>Stutzerimonas marianensis</em> strain LB-0542 isolated from the decomposing pelagic <em>Sargassum</em> biomass stranded on Long Beach, Barbados. The genomic DNA was sequenced with the Illumina NextSeq2000 platform. The genome assembly was performed with the SPAdes Genome Assembler (ver 3.15.5). The assembled genome has a size of 4520,813 bp, a coverage of 110X, a GC content of 63.2 %, a L<sub>50</sub> of 2 and a N<sub>50</sub> of 1079,143 bp. The genome consists of 12 contigs, 0 CRISPR, 3 rRNA, 56 tRNA and 4166 CDSs (coding sequences) with a coding ratio of 89.4 %. The genome annotation results for the COG (cluster of orthologous genes) and subsystem features indicate that the metabolism and the amino acids and derivatives are the most dominant categories, respectively. The analysis of the genome for the existence of Carbohydrate-Active Enzymes (CAZymes) identified 230 genes encoding four functional classes of CAZymes [glycoside hydrolases (75 genes), glycosyltransferases (95 genes), carbohydrate esterases (9 genes) and carbohydrate-binding modules (51 genes)]. The functional annotation of the genome for plastic degradation revealed the presence of 34 genes, which could catalyse the degradation process of 14 types of plastics, polyethylene glycol [PEG (29 %)], polylactic acid [PLA (11 %)], poly(3-hydroxybutyrate-co-3-hydroxyvalerate) [PHBV (9 %)], polyhydroxyalkanoates [PHA (9 %)], polyethylene [PE (6 %)], polycaprolactone [PCL (6 %)], polyethersulfone [PES (6 %)], polyethylene terephthalate [PET (6 %)], poly(butylene adipate-co-terephthalate [PBAT (3 %)], (polystyrene [PS (3 %)], polybutylene succinate [PBSA (3 %)], poly(3-hydroxyvalerate) [P3HV (3 %)], polyvinyl alcohol [PVA (3 %)] and natural rubber [NR (3 %)]. The genome mining for plant growth-promoting traits identified 3175 genes that are associated with the colonizing plant system (26 %), competitive exclusion (21 %), stress control (21 %), biofertilization (14 %), phytohormone and plant signal production (10 %), bioremediation (7 %) and plant immune response stimulation (1 %). These genome mining results are an indication of the biotechnological and ecological significance of the novel strain LB-0542 for sustainable biocatalytic processing of <em>Sargassum</em> and plastic-containing waste. The genome sequence data is available in DDBJ/EMBL/GenBank with the accession number BAAIAE000000000.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112454"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146036167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-15DOI: 10.1016/j.dib.2026.112478
Olga Esteban-Sinovas , Rosario Sarabia , Ignacio Arrese , Vikas Singh , Prakash Shett , Aliasgar Moiyadi , Ilyess Zemmoura , Massimiliano Del Bene , Arianna Barbotti , Francesco DiMeco , Timothy Richard West , Brian Vala Nahed , Giuseppe Roberto Giammalva , Santiago Cepeda
The BraTioUS (Brain Tumor Intraoperative Ultrasound) dataset [1] is a large-scale, multicenter, and publicly available collection of intraoperative ultrasound (ioUS) images acquired during glioma surgeries. Created through an international collaboration among six hospitals across five countries, BraTioUS comprises 1669 B-mode 2D ioUS images from 142 glioma patients collected between 2018 and 2023 using various ultrasound systems and acquisition protocols. It also includes masks supervised by experts of tumor segmentation from every ioUS image.
BraTioUS addresses several limitations found in existing public datasets, such as lack of diversity in acquisition hardware, imaging protocols, and glioma types. The primary objective of this dataset is to be publicly available and accessible for the training and validation of machine learning models aimed at improving the interpretation and use of ioUS. The dataset’s scale, quality, and heterogeneity make it a valuable resource for training and validating AI tools aimed at improving intraoperative decision-making and patient outcomes in glioma surgery.
{"title":"BraTioUS: A multicenter dataset of baseline intraoperative brain tumor ultrasound images","authors":"Olga Esteban-Sinovas , Rosario Sarabia , Ignacio Arrese , Vikas Singh , Prakash Shett , Aliasgar Moiyadi , Ilyess Zemmoura , Massimiliano Del Bene , Arianna Barbotti , Francesco DiMeco , Timothy Richard West , Brian Vala Nahed , Giuseppe Roberto Giammalva , Santiago Cepeda","doi":"10.1016/j.dib.2026.112478","DOIUrl":"10.1016/j.dib.2026.112478","url":null,"abstract":"<div><div>The BraTioUS (Brain Tumor Intraoperative Ultrasound) dataset [<span><span>1</span></span>] is a large-scale, multicenter, and publicly available collection of intraoperative ultrasound (ioUS) images acquired during glioma surgeries. Created through an international collaboration among six hospitals across five countries, BraTioUS comprises 1669 B-mode 2D ioUS images from 142 glioma patients collected between 2018 and 2023 using various ultrasound systems and acquisition protocols. It also includes masks supervised by experts of tumor segmentation from every ioUS image.</div><div>BraTioUS addresses several limitations found in existing public datasets, such as lack of diversity in acquisition hardware, imaging protocols, and glioma types. The primary objective of this dataset is to be publicly available and accessible for the training and validation of machine learning models aimed at improving the interpretation and use of ioUS. The dataset’s scale, quality, and heterogeneity make it a valuable resource for training and validating AI tools aimed at improving intraoperative decision-making and patient outcomes in glioma surgery.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112478"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146075121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This dataset provides a comprehensive, multidimensional phytochemical characterization of Lekhniya Mahakashaya (LMK), a classical Ayurvedic formulation used for the Treatment of obesity and metabolic disorders. Three complementary analytical platforms were employed: High-Resolution Liquid Chromatography-Mass Spectrometry/Mass Spectrometry (HRLC-MS/MS) Orbitrap, High-Performance Thin Layer Chromatography (HPTLC), and Fourier-Transform Infrared (FTIR) spectroscopy. For HRLC-MS/MS analysis, Hydroalcoholic extracts of LMK were prepared and analysed in both Positive and negative ionisation modes using an Orbitrap mass spectrometer. The dataset includes 2034 metabolomics-identified compounds: 1712 in positive ion mode and 322 in negative ion mode, with detailed retention times, molecular weights, and fragmentation patterns, suitable for compound annotation, metabolite networking, and cheminformatics-based correlation studies. HPTLC fingerprinting was performed using methanolic extracts (2–10 µL) on silica gel 60 F₂₅₄ plates, which yielded 7–8 reproducible peaks across the Rf range 0.12–0.89 under 254 nm, 366 nm, and 540 nm, confirming LMK’s polyherbal complexity. Marker-based quantification revealed that berberine (0.24 % w/w) and curcumin (0.31 % w/w) were performed using validated HPTLC protocols, and calibration curves are included for reproducibility. FTIR Spectroscopic data encompass 19 absorption peaks (3278–0468 cm⁻¹), representing hydroxyl, aliphatic, unsaturated, sulfur-, nitrogen-, and halogen-containing functional groups, which highlights LMK’s diverse phytochemical matrix. This dataset is structured for pharmacological exploration, quality control, and phytochemical standardisation of LMK and associated Ayurvedic formulations. This dataset is a reference resource. Additionally, the dataset can be used for molecular docking validation, network pharmacology mapping, metabolomics comparisons, and future drug discovery. To promote transparency, encourage computational or experimental reuse, and support integrative research on traditional medicine, all raw chromatograms, spectrum files, and processed data tables are made available in widely accessible formats.
{"title":"Multi-analytical dataset on Lekhaniya Mahakashaya: HRLC-MS/MS Orbitrap profiling, HPTLC fingerprinting with marker estimation, and FTIR spectroscopy","authors":"Narayan Singh, Anjali Upadhyay, Debajyoti Chakraborty, Girimalla Patil, Pramod Yadav, Galib R, Pradeep Kumar Prajapati","doi":"10.1016/j.dib.2026.112464","DOIUrl":"10.1016/j.dib.2026.112464","url":null,"abstract":"<div><div>This dataset provides a comprehensive, multidimensional phytochemical characterization of <em>Lekhniya Mahakashaya</em> (LMK), a classical Ayurvedic formulation used for the Treatment of obesity and metabolic disorders. Three complementary analytical platforms were employed: High-Resolution Liquid Chromatography-Mass Spectrometry/Mass Spectrometry (HRLC-MS/MS) Orbitrap, High-Performance Thin Layer Chromatography (HPTLC), and Fourier-Transform Infrared (FTIR) spectroscopy. For HRLC-MS/MS analysis, Hydroalcoholic extracts of LMK were prepared and analysed in both Positive and negative ionisation modes using an Orbitrap mass spectrometer. The dataset includes 2034 metabolomics-identified compounds: 1712 in positive ion mode and 322 in negative ion mode, with detailed retention times, molecular weights, and fragmentation patterns, suitable for compound annotation, metabolite networking, and cheminformatics-based correlation studies. HPTLC fingerprinting was performed using methanolic extracts (2–10 µL) on silica gel 60 F₂₅₄ plates, which yielded 7–8 reproducible peaks across the Rf range 0.12–0.89 under 254 nm, 366 nm, and 540 nm, confirming LMK’s polyherbal complexity. Marker-based quantification revealed that berberine (0.24 % w/w) and curcumin (0.31 % w/w) were performed using validated HPTLC protocols, and calibration curves are included for reproducibility. FTIR Spectroscopic data encompass 19 absorption peaks (3278–0468 cm⁻¹), representing hydroxyl, aliphatic, unsaturated, sulfur-, nitrogen-, and halogen-containing functional groups, which highlights LMK’s diverse phytochemical matrix. This dataset is structured for pharmacological exploration, quality control, and phytochemical standardisation of LMK and associated Ayurvedic formulations. This dataset is a reference resource. Additionally, the dataset can be used for molecular docking validation, network pharmacology mapping, metabolomics comparisons, and future drug discovery. To promote transparency, encourage computational or experimental reuse, and support integrative research on traditional medicine, all raw chromatograms, spectrum files, and processed data tables are made available in widely accessible formats.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112464"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146075124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-02-07DOI: 10.1016/j.dib.2026.112551
Younes Ouargani, Noussaima El Khattabi
Video datasets are crucial for advancing communication technologies for deaf and hard-of-hearing individuals. Despite that, extensive datasets are not available for the majority of sign languages due to the ample work required to capture, clean, organize, and publish them. This paper introduces the Video Dataset for Algerian Sign Language (VDzSL), the largest video dataset for Algerian Sign Language. To ensure demographic diversity, VDzSL utilizes four different avatars to animate the signs and records them from five distinct camera angles, employing polar coordinates to ensure consistency while capturing varying horizontal and vertical perspectives. With 415 signs, our dataset has a 99.5% coverage of the signs included in 3DZSignDB’s SiGML dataset, and 26.6% coverage of the official ALGSL dictionary provided by the Algerian Ministry of Solidarity. Our dataset contains 8300 video files totaling 3 h, 11 min, and 43 s of synthetic videos provided at a 498×498 pixel resolution and an average frame rate of 27 frames per second across the entire dataset. The dataset is primarily aimed at training, testing, and benchmarking machine learning models, facilitating transfer learning and comparative analyses, as well as developing learning tools and accessibility applications.
{"title":"VDzSL: A synthetic video dataset for Algerian sign language using 3D avatars","authors":"Younes Ouargani, Noussaima El Khattabi","doi":"10.1016/j.dib.2026.112551","DOIUrl":"10.1016/j.dib.2026.112551","url":null,"abstract":"<div><div>Video datasets are crucial for advancing communication technologies for deaf and hard-of-hearing individuals. Despite that, extensive datasets are not available for the majority of sign languages due to the ample work required to capture, clean, organize, and publish them. This paper introduces the Video Dataset for Algerian Sign Language (VDzSL), the largest video dataset for Algerian Sign Language. To ensure demographic diversity, VDzSL utilizes four different avatars to animate the signs and records them from five distinct camera angles, employing polar coordinates to ensure consistency while capturing varying horizontal and vertical perspectives. With 415 signs, our dataset has a 99.5% coverage of the signs included in 3DZSignDB’s SiGML dataset, and 26.6% coverage of the official ALGSL dictionary provided by the Algerian Ministry of Solidarity. Our dataset contains 8300 video files totaling 3 h, 11 min, and 43 s of synthetic videos provided at a 498×498 pixel resolution and an average frame rate of 27 frames per second across the entire dataset. The dataset is primarily aimed at training, testing, and benchmarking machine learning models, facilitating transfer learning and comparative analyses, as well as developing learning tools and accessibility applications.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112551"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-01-29DOI: 10.1016/j.dib.2026.112516
Jhovany Quintana-Vera, Ana I. González-Tablas, Mohammed Rashed
The use of smart home devices is on the rise with estimations of the number of users reaching over 785 million from the current ∼ 361 million users; a 117% increase in just 4 years. Thus, it becomes essential to have an available dataset that provides details about the different aspects of the available devices in the market. In this paper, we introduce our dataset titled Spanish MArket Smart Home devices (SMASH) which we collected via structured data extraction from four major Spanish e-commerce platforms. Containing 5218 devices across 652 brands, the dataset provides an overview of smart home devices sold within Spain, the fourth largest economy in the European Union. The dataset is versatile as it includes details such as name, price, brand, model, rating, number of reviews, platform and category. The dataset can be used as primary source in research that involves consumer behaviour and microeconomics. Additionally, the details could be used for creating new datasets like privacy policies of brands and mobile applications (apps) used for the devices. The dataset is publicly accessible under license CC-BY-NC-4.0-ES. We note, however, that SMASH is limited to products sold within Spain and collected within a specific time window (start date: 2023–12; end date: 2024–08); users should consider the scope and temporal constraints when generalizing findings.
{"title":"A dataset of smart home devices sold on Spanish e-commerce platforms","authors":"Jhovany Quintana-Vera, Ana I. González-Tablas, Mohammed Rashed","doi":"10.1016/j.dib.2026.112516","DOIUrl":"10.1016/j.dib.2026.112516","url":null,"abstract":"<div><div>The use of smart home devices is on the rise with estimations of the number of users reaching over 785 million from the current ∼ 361 million users; a 117% increase in just 4 years. Thus, it becomes essential to have an available dataset that provides details about the different aspects of the available devices in the market. In this paper, we introduce our dataset titled <strong>Spanish MArket Smart Home devices (SMASH)</strong> which we collected via structured data extraction from four major Spanish e-commerce platforms. Containing 5218 devices across 652 brands, the dataset provides an overview of smart home devices sold within Spain, the fourth largest economy in the European Union. The dataset is versatile as it includes details such as name, price, brand, model, rating, number of reviews, platform and category. The dataset can be used as primary source in research that involves consumer behaviour and microeconomics. Additionally, the details could be used for creating new datasets like privacy policies of brands and mobile applications (apps) used for the devices. The dataset is publicly accessible under license CC-BY-NC-4.0-ES. We note, however, that SMASH is limited to products sold within Spain and collected within a specific time window (start date: 2023–12; end date: 2024–08); users should consider the scope and temporal constraints when generalizing findings.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112516"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2026-02-02DOI: 10.1016/j.dib.2026.112541
Roger Chiu-Coutino , Miguel S. Soriano-Garcia , Carlos Israel Medel-Ruiz , S.M. Afanador-Delgado , Edgar Villafaña-Rauda , Roger Chiu
This data article presents an experimental dataset of scattered images, obtained using a low-cost, open-source, Raspberry Pi-based optical system. Each data sample includes two grayscale images of 256 × 256 resolution: the (i) scattered image, and (ii) original projected pattern as ground truth. The system projects diverse patterns using various optical diffusers with different scattering coefficients and physical thicknesses. The dataset includes geometric shapes, digits, and textures to increase variability and generalization. This variety allows the analysis of distinct scattering regimes and evaluation of image recovery models under varying optical complexities. The dataset supports deep learning research focused on inverse problems in optics. It is particularly useful for training and benchmarking image restoration models in scattering environments.
{"title":"Dataset of scattered images using noncoherent light under varying diffusion conditions and projected patterns","authors":"Roger Chiu-Coutino , Miguel S. Soriano-Garcia , Carlos Israel Medel-Ruiz , S.M. Afanador-Delgado , Edgar Villafaña-Rauda , Roger Chiu","doi":"10.1016/j.dib.2026.112541","DOIUrl":"10.1016/j.dib.2026.112541","url":null,"abstract":"<div><div>This data article presents an experimental dataset of scattered images, obtained using a low-cost, open-source, Raspberry Pi-based optical system. Each data sample includes two grayscale images of 256 × 256 resolution: the (i) scattered image, and (ii) original projected pattern as ground truth. The system projects diverse patterns using various optical diffusers with different scattering coefficients and physical thicknesses. The dataset includes geometric shapes, digits, and textures to increase variability and generalization. This variety allows the analysis of distinct scattering regimes and evaluation of image recovery models under varying optical complexities. The dataset supports deep learning research focused on inverse problems in optics. It is particularly useful for training and benchmarking image restoration models in scattering environments.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112541"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146185211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This dataset presents experimental data on the performance of a photovoltaic (PV) solar-powered water pumping system installed in a coffee plantation in Chiang Mai province, Thailand. The system performance was evaluated through controlled experiments using response surface methodology (RSM). Three independent variables were systematically varied: solar irradiance (300–900 W/m²), panel inclination (15–35°), and panel surface temperature (30–60°C). A total of 15 experimental runs were conducted, and the pumping efficiency (%) was recorded under each condition. Statistical analyses, including analysis of variance (ANOVA) and regression modeling, were applied to evaluate the effects of the individual variables and their interactions on system performance. The dataset includes raw and processed measurements, regression coefficients, and response surface parameters, enabling replication and further analysis. Perturbation plots, 3D surface plots, and contour plots provide detailed visualizations of the relationships between environmental factors and system efficiency. The optimal operating conditions were identified at a solar irradiance of 600 W/m², a panel inclination of 25°, and a panel surface temperature of 45°C, corresponding to a predicted maximum efficiency of 76.3–77.0%.
This dataset can be reused for designing optimized solar water pumping systems, validating predictive models, and comparing system performance under different environmental conditions or geographic locations. It also serves as a reference for researchers in renewable energy system optimization and agricultural water management. The data provide high-resolution, experimentally validated information on the combined effects of solar irradiance, panel inclination, and panel surface temperature on PV water pumping efficiency. Unlike previous studies, it includes detailed quantitative analysis specific to coffee-growing regions in Northern Thailand, along with regression models and visualizations that can guide both experimental replication and predictive modeling under similar climatic and agricultural conditions
{"title":"Dataset on the performance of a photovoltaic solar water pump in coffee plantations using response surface methodology (RSM)","authors":"Nopparat Suriyachai, Torpong Kreetachat, Saksit Imman","doi":"10.1016/j.dib.2026.112467","DOIUrl":"10.1016/j.dib.2026.112467","url":null,"abstract":"<div><div>This dataset presents experimental data on the performance of a photovoltaic (PV) solar-powered water pumping system installed in a coffee plantation in Chiang Mai province, Thailand. The system performance was evaluated through controlled experiments using response surface methodology (RSM). Three independent variables were systematically varied: solar irradiance (300–900 W/m²), panel inclination (15–35°), and panel surface temperature (30–60°C). A total of 15 experimental runs were conducted, and the pumping efficiency (%) was recorded under each condition. Statistical analyses, including analysis of variance (ANOVA) and regression modeling, were applied to evaluate the effects of the individual variables and their interactions on system performance. The dataset includes raw and processed measurements, regression coefficients, and response surface parameters, enabling replication and further analysis. Perturbation plots, 3D surface plots, and contour plots provide detailed visualizations of the relationships between environmental factors and system efficiency. The optimal operating conditions were identified at a solar irradiance of 600 W/m², a panel inclination of 25°, and a panel surface temperature of 45°C, corresponding to a predicted maximum efficiency of 76.3–77.0%.</div><div>This dataset can be reused for designing optimized solar water pumping systems, validating predictive models, and comparing system performance under different environmental conditions or geographic locations. It also serves as a reference for researchers in renewable energy system optimization and agricultural water management. The data provide high-resolution, experimentally validated information on the combined effects of solar irradiance, panel inclination, and panel surface temperature on PV water pumping efficiency. Unlike previous studies, it includes detailed quantitative analysis specific to coffee-growing regions in Northern Thailand, along with regression models and visualizations that can guide both experimental replication and predictive modeling under similar climatic and agricultural conditions</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"65 ","pages":"Article 112467"},"PeriodicalIF":1.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146036170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}