With the development of smart buildings, the risks of cyber-attacks against them have also increased. One of the popular and evolving protocols used for communication between devices in smart buildings, especially HVAC systems, is the BACnet protocol. Machine learning algorithms and neural networks require datasets of normal traffic and real attacks to develop intrusion detection (IDS) and prevention (IPS) systems that can detect anomalies and prevent attacks. Real traffic datasets for these networks are often unavailable due to confidentiality reasons. To address this, we propose a framework that uses existing real datasets and converts them into BACnet protocol network traffic with detailed network behaviour. In this method, a virtual machine is prepared for each controller based on real scenarios, and by creating a simulator for the controller on the virtual machine, real data previously collected under real conditions from existing datasets is injected into the network with the same date and time during the simulation. We performed three types of attacks, including Falsifying, Modifying, and covert channel attacks on the network. For covert channel attacks, the message was modelled in three forms: Plain text, hashed using SHA3-256, and encrypted using AES-256. Network traffic was recorded using Wireshark software in pcap format. The advantage of the generated dataset is that since we used real data, the data behaviour aligns with real conditions.
{"title":"Developing a comprehensive BACnet attack dataset: A step towards improved cybersecurity in building automation systems.","authors":"Seyed Amirhossein Moosavi, Mojtaba Asgari, Seyed Reza Kamel","doi":"10.1016/j.dib.2024.111192","DOIUrl":"10.1016/j.dib.2024.111192","url":null,"abstract":"<p><p>With the development of smart buildings, the risks of cyber-attacks against them have also increased. One of the popular and evolving protocols used for communication between devices in smart buildings, especially HVAC systems, is the BACnet protocol. Machine learning algorithms and neural networks require datasets of normal traffic and real attacks to develop intrusion detection (IDS) and prevention (IPS) systems that can detect anomalies and prevent attacks. Real traffic datasets for these networks are often unavailable due to confidentiality reasons. To address this, we propose a framework that uses existing real datasets and converts them into BACnet protocol network traffic with detailed network behaviour. In this method, a virtual machine is prepared for each controller based on real scenarios, and by creating a simulator for the controller on the virtual machine, real data previously collected under real conditions from existing datasets is injected into the network with the same date and time during the simulation. We performed three types of attacks, including Falsifying, Modifying, and covert channel attacks on the network. For covert channel attacks, the message was modelled in three forms: Plain text, hashed using SHA3-256, and encrypted using AES-256. Network traffic was recorded using Wireshark software in pcap format. The advantage of the generated dataset is that since we used real data, the data behaviour aligns with real conditions.</p>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"111192"},"PeriodicalIF":1.0,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11683266/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-03eCollection Date: 2024-12-01DOI: 10.1016/j.dib.2024.111194
Lawrence McKnight, Chandra Jaiswal, Issa AlHmoud, Balakrishna Gokaraju
Effective data representation in machine learning and deep learning is paramount. For an algorithm or neural network to capture patterns in data and be able to make reliable predictions, the data must appropriately describe the problem domain. Although there exists much literature on data preprocessing for machine learning and data science applications, novel data representation methods for enhancing machine learning model performance remain highly absent within the literature. This dataset is a compilation of convolutional neural network model performance trained and tested on a wide range of numerical base representations of the MNIST and MNIST-C datasets. This performance data can be further analysed by the research community to uncover trends in model performance against the numerical base of its data. This dataset can be used to produce more research of the same nature, testing cross-base data encoding on machine learning training and testing data for a wide range of real-world applications.
{"title":"A dataset of deep learning performance from cross-base data encoding on MNIST and MNIST-C.","authors":"Lawrence McKnight, Chandra Jaiswal, Issa AlHmoud, Balakrishna Gokaraju","doi":"10.1016/j.dib.2024.111194","DOIUrl":"https://doi.org/10.1016/j.dib.2024.111194","url":null,"abstract":"<p><p>Effective data representation in machine learning and deep learning is paramount. For an algorithm or neural network to capture patterns in data and be able to make reliable predictions, the data must appropriately describe the problem domain. Although there exists much literature on data preprocessing for machine learning and data science applications, novel data representation methods for enhancing machine learning model performance remain highly absent within the literature. This dataset is a compilation of convolutional neural network model performance trained and tested on a wide range of numerical base representations of the MNIST and MNIST-C datasets. This performance data can be further analysed by the research community to uncover trends in model performance against the numerical base of its data. This dataset can be used to produce more research of the same nature, testing cross-base data encoding on machine learning training and testing data for a wide range of real-world applications.</p>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"111194"},"PeriodicalIF":1.0,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11697575/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142930887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-03eCollection Date: 2024-12-01DOI: 10.1016/j.dib.2024.111193
Quan Huu Nguyen, Trinh Van Nguyen, Thuy Thi Xuan Vi, Thuy Thi Thu Vu, Lan Thi Ngoc Nguyen, Yen Thi Hai Nguyen, Hung Duc Nguyen, Tan Quang Tu, Mau Hoang Chu
Species of the Boehmeria genus have the potential to be natural medicines and have industrial fibre production uses. Many species of this genus are morphologically similar and are difficult to distinguish, especially when their morphology is distorted. This dataset includes sequence information of several DNA regions isolated from the genome of Boehmeria holosericea, namely ITS (from the nuclear genome), matK, trnL-trnF, trnH-psbA, and rpoC1 (from the chloroplast genome) and phylogenetic analysis results based on the isolated sequences. On the phylogenetic tree based on the matK gene sequence, B. holosericea is grouped with B. umbrosa, B. clidemioides, B. spicata, and B. macrophylla with a bootstrap coefficient of 100%. In the phylogenetic tree based on the trnH-psbA spacer region sequences, B. holosericea was grouped with B. clidemioides (a bootstrap coefficient of 96%). In the phylogenetic tree based on the rpoC1 gene sequences, B. holosericea was grouped with B. spicata (a bootstrap coefficient of 100%). In the phylogenetic tree based on the ITS region sequences, B. holosericea was grouped with B. macrophylla (a bootstrap coefficient of 73%), and based on the trnL-trnF spacer region, B. holosericea was grouped with B. pilociuscula (a bootstrap coefficient of 16%). Two genes, matK and rpoC1 and the trnH-psbA region from the chloroplast genome, are potential DNA barcode candidates that could aid in the species identification of B. holosericea. This dataset the first report on the ITS, matK, trnL-trnF, trnH-psbA, and rpoC1 sequences and the phylogeny of B. holosericea.
苧麻属的物种有可能成为天然药物,并具有工业纤维生产用途。该属的许多物种形态相似,很难区分,尤其是当它们的形态扭曲时。本数据集包括从苧麻基因组中分离出的几个 DNA 区域的序列信息,即 ITS(来自核基因组)、matK、trnL-trnF、trnH-psbA 和 rpoC1(来自叶绿体基因组),以及基于分离序列的系统发生分析结果。在基于 matK 基因序列的系统发生树上,B. holosericea 与 B. umbrosa、B. clidemioides、B. spicata 和 B. macrophylla 被归为一类,引导系数为 100%。在基于 trnH-psbA spacer 区域序列的系统发生树中,B. holosericea 与 B. clidemioides 被归为一类(bootstrap coefficient 为 96%)。在基于 rpoC1 基因序列的系统发生树中,B. holosericea 与 B. spicata 被归为一类(引导系数为 100%)。在基于 ITS 区域序列的系统发生树中,B. holosericea 与 B. macrophylla 被归为一类(bootstrap 系数为 73%);基于 trnL-trnF spacer 区域,B. holosericea 与 B. pilociuscula 被归为一类(bootstrap 系数为 16%)。叶绿体基因组中的两个基因 matK 和 rpoC1 以及 trnH-psbA 区域是潜在的 DNA 条形码候选者,可帮助鉴定全丝核菌的物种。该数据集首次报道了 ITS、matK、trnL-trnF、trnH-psbA 和 rpoC1 序列以及 B. holosericea 的系统发育。
{"title":"Dataset on ITS and some chloroplast DNA regions of <i>Boehmeria holosericea</i> Blume in Vietnam.","authors":"Quan Huu Nguyen, Trinh Van Nguyen, Thuy Thi Xuan Vi, Thuy Thi Thu Vu, Lan Thi Ngoc Nguyen, Yen Thi Hai Nguyen, Hung Duc Nguyen, Tan Quang Tu, Mau Hoang Chu","doi":"10.1016/j.dib.2024.111193","DOIUrl":"10.1016/j.dib.2024.111193","url":null,"abstract":"<p><p>Species of the <i>Boehmeria</i> genus have the potential to be natural medicines and have industrial fibre production uses. Many species of this genus are morphologically similar and are difficult to distinguish, especially when their morphology is distorted. This dataset includes sequence information of several DNA regions isolated from the genome of <i>Boehmeria holosericea</i>, namely ITS (from the nuclear genome), <i>matK</i>, trnL-trnF, trnH-psbA, and <i>rpoC1</i> (from the chloroplast genome) and phylogenetic analysis results based on the isolated sequences. On the phylogenetic tree based on the matK gene sequence, B. holosericea is grouped with <i>B. umbrosa, B. clidemioides, B. spicata, and B. macrophylla</i> with a bootstrap coefficient of 100%. In the phylogenetic tree based on the trnH-psbA spacer region sequences, <i>B. holosericea</i> was grouped with B. clidemioides (a bootstrap coefficient of 96%). In the phylogenetic tree based on the <i>rpoC1</i> gene sequences, <i>B. holosericea</i> was grouped with <i>B. spicata</i> (a bootstrap coefficient of 100%). In the phylogenetic tree based on the ITS region sequences, <i>B. holosericea</i> was grouped with B<i>. macrophylla</i> (a bootstrap coefficient of 73%), and based on the trnL-trnF spacer region, <i>B. holosericea</i> was grouped with <i>B. pilociuscula</i> (a bootstrap coefficient of 16%). Two genes, <i>matK</i> and <i>rpoC1</i> and the trnH-psbA region from the chloroplast genome, are potential DNA barcode candidates that could aid in the species identification of <i>B. holosericea</i>. This dataset the first report on the ITS, <i>matK</i>, trnL-trnF, trnH-psbA, and <i>rpoC1</i> sequences and the phylogeny of <i>B. holosericea.</i></p>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"111193"},"PeriodicalIF":1.0,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11683258/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-02eCollection Date: 2024-12-01DOI: 10.1016/j.dib.2024.111171
Zacharias Dahl, Aleksanteri Hämäläinen, Aku Karhinen, Jesse Miettinen, Andre Böhme, Samuel Lillqvist, Sampo Haikonen, Raine Viitala
Accurate system health state prediction through deep learning requires extensive and varied data. However, real-world data scarcity poses a challenge for developing robust fault diagnosis models. This study introduces two extensive datasets, Aalto Shim Dataset and Aalto Gear Fault Dataset, collected under controlled laboratory conditions, aimed at advancing deep learning-based fault diagnosis. The datasets encompass a wide range of gear faults, including synthetic and realistic failure modes, replicated on a downsized azimuth thruster testbench equipped with multiple sensors. The data features various fault types and severities under different operating conditions. The comprehensive data collected, along with the methodologies for creating synthetic faults and replicating common gear failures, provide valuable resources for developing and testing intelligent fault diagnosis models, enhancing their generalization and robustness across diverse scenarios.
{"title":"Aalto Gear Fault datasets for deep-learning based diagnosis.","authors":"Zacharias Dahl, Aleksanteri Hämäläinen, Aku Karhinen, Jesse Miettinen, Andre Böhme, Samuel Lillqvist, Sampo Haikonen, Raine Viitala","doi":"10.1016/j.dib.2024.111171","DOIUrl":"10.1016/j.dib.2024.111171","url":null,"abstract":"<p><p>Accurate system health state prediction through deep learning requires extensive and varied data. However, real-world data scarcity poses a challenge for developing robust fault diagnosis models. This study introduces two extensive datasets, Aalto Shim Dataset and Aalto Gear Fault Dataset, collected under controlled laboratory conditions, aimed at advancing deep learning-based fault diagnosis. The datasets encompass a wide range of gear faults, including synthetic and realistic failure modes, replicated on a downsized azimuth thruster testbench equipped with multiple sensors. The data features various fault types and severities under different operating conditions. The comprehensive data collected, along with the methodologies for creating synthetic faults and replicating common gear failures, provide valuable resources for developing and testing intelligent fault diagnosis models, enhancing their generalization and robustness across diverse scenarios.</p>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"111171"},"PeriodicalIF":1.0,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11683272/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-02eCollection Date: 2024-12-01DOI: 10.1016/j.dib.2024.111184
Rotimi-Williams Bello, Pius A Owolawi, Etienne A van Wyk, Chunling Du
Solar energy has become the fastest growing renewable and alternative source of energy. However, there is little or no open-source datasets to advance research knowledge in photovoltaic related systems. The work presented in this article is a step towards deriving Photo-Voltaic Module Dataset (PVMD) of thermal images and ensuring they are publicly available. The work provides a PVMD dataset comprising a total of 1000 self-acquired and augmented images. The dataset includes both permanent and temporal anomalies, namely Hotspots, Cracks, and Shadings. The dataset was collected on September 5, 2024 at the Soshanguve South Campus, Tshwane University of Technology, South Africa using DJI Mavic 3 Thermal's high-resolution thermal and visual imaging capabilities. DJI Mavic 3 Thermal coupled with its advanced flight features makes it an excellent tool for precise and efficient inspections of PV systems. The laboratory experiment performed on the dataset lasted one week. The work aims to provide supervised dataset good enough to support research method in providing a comprehensive and efficient approach to monitoring and maintaining large PV systems. Extensive analysis of the thermal data reveals the anomalies as indicative of faults in the solar cells of PV module, thereby opening up advancement in solar energy research. Because the data comes from a single-day collection and one week laboratory experiment, it makes the data more suitable for testing algorithms designed for fault detection. The dataset is publicly and freely available to the scientific community at 10.17632/5ssmfpgrpc.1.
{"title":"Photovoltaic module dataset for automated fault detection and analysis in large photovoltaic systems using photovoltaic module fault detection.","authors":"Rotimi-Williams Bello, Pius A Owolawi, Etienne A van Wyk, Chunling Du","doi":"10.1016/j.dib.2024.111184","DOIUrl":"https://doi.org/10.1016/j.dib.2024.111184","url":null,"abstract":"<p><p>Solar energy has become the fastest growing renewable and alternative source of energy. However, there is little or no open-source datasets to advance research knowledge in photovoltaic related systems. The work presented in this article is a step towards deriving Photo-Voltaic Module Dataset (PVMD) of thermal images and ensuring they are publicly available. The work provides a PVMD dataset comprising a total of 1000 self-acquired and augmented images. The dataset includes both permanent and temporal anomalies, namely Hotspots, Cracks, and Shadings. The dataset was collected on September 5, 2024 at the Soshanguve South Campus, Tshwane University of Technology, South Africa using DJI Mavic 3 Thermal's high-resolution thermal and visual imaging capabilities. DJI Mavic 3 Thermal coupled with its advanced flight features makes it an excellent tool for precise and efficient inspections of PV systems. The laboratory experiment performed on the dataset lasted one week. The work aims to provide supervised dataset good enough to support research method in providing a comprehensive and efficient approach to monitoring and maintaining large PV systems. Extensive analysis of the thermal data reveals the anomalies as indicative of faults in the solar cells of PV module, thereby opening up advancement in solar energy research. Because the data comes from a single-day collection and one week laboratory experiment, it makes the data more suitable for testing algorithms designed for fault detection. The dataset is publicly and freely available to the scientific community at 10.17632/5ssmfpgrpc.1.</p>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"111184"},"PeriodicalIF":1.0,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11683254/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-02eCollection Date: 2024-12-01DOI: 10.1016/j.dib.2024.111183
Jorge Garate-Quispe, Ramiro Canahuire-Robles, Marx Herrera-Machaca, Sufer Baez-Quispe, Gabriel Alarcón-Aguirre
Anthropogenic activities (e.g., logging, gold-mining, agriculture, and uncontrolled urban expansion) threaten the forests in the southeast of the Peruvian Amazon, one of the most diverse ecosystems worldwide. However, gold-mining generates the most severe impacts on ecosystems and limits its resilience. The natural regeneration of degraded areas in the southeastern Peruvian Amazon have not been studied deeply. The dataset contains floristic inventories of previously uncharacterized or poorly studied secondary forests degraded and abandoned by goldmining activities and an intact forest in the Tres Islas indigenous community, Madre de Dios region, in southeastern Peru. The data presented was obtained from 12 plots (20 m × 60 m) established in three successional forests abandoned by gold mining and an intact forest (without mining impacts), where all trees with a stem diameter at breast height greater than 1 cm were inventoried. To the best of our knowledge, this is the only dataset in the southwest of the Peruvian Amazon that compares the natural colonization after gold-mining and intact forests. This dataset can be useful for long-term study and monitoring of structure and tree diversity in relatively understudied yet important secondary forests after gold-mining abandonment. Also, this dataset could be used to analyze the successional trajectory process of vegetation and the recovery of aboveground biomass. Furthermore, the data could be used to investigate the effects of functional traits and types of mining on vegetation recovery. Hence, understanding the successional processes will help to improve restoration, reforestation, or reclamation strategies for the recovery of degraded lands in the Amazon.
人类活动(如伐木、金矿开采、农业和不受控制的城市扩张)威胁着秘鲁亚马逊东南部的森林,这是世界上最多样化的生态系统之一。然而,金矿开采对生态系统的影响最为严重,并限制了生态系统的恢复能力。秘鲁亚马逊东南部退化地区的自然再生尚未得到深入研究。该数据集包含了秘鲁东南部马德雷德迪奥斯地区特雷斯群岛土著社区一片完整森林的植物区系清单,其中包括以前未被描述或研究较少的次生林,以及因金矿开采活动而退化和废弃的次生林。本文的数据来自3个金矿开采废弃的演替森林和一个完整森林(没有采矿影响)中的12个样地(20 m × 60 m),其中所有茎粗胸高大于1 cm的树木都被调查。据我们所知,这是秘鲁亚马逊西南部唯一一个比较金矿开采后自然殖民化和完整森林的数据集。该数据集可用于长期研究和监测研究相对较少但重要的次生林在放弃金矿开采后的结构和树木多样性。该数据集还可用于分析植被演替轨迹过程和地上生物量恢复。此外,这些数据可用于研究功能性状和采矿类型对植被恢复的影响。因此,了解演替过程将有助于改善亚马逊退化土地的恢复、再造林或复垦策略。
{"title":"Field data on diversity and vegetation structure of natural regeneration in a chronosequence of abandoned gold-mining lands in a tropical Amazon forest.","authors":"Jorge Garate-Quispe, Ramiro Canahuire-Robles, Marx Herrera-Machaca, Sufer Baez-Quispe, Gabriel Alarcón-Aguirre","doi":"10.1016/j.dib.2024.111183","DOIUrl":"10.1016/j.dib.2024.111183","url":null,"abstract":"<p><p>Anthropogenic activities (e.g., logging, gold-mining, agriculture, and uncontrolled urban expansion) threaten the forests in the southeast of the Peruvian Amazon, one of the most diverse ecosystems worldwide. However, gold-mining generates the most severe impacts on ecosystems and limits its resilience. The natural regeneration of degraded areas in the southeastern Peruvian Amazon have not been studied deeply. The dataset contains floristic inventories of previously uncharacterized or poorly studied secondary forests degraded and abandoned by goldmining activities and an intact forest in the Tres Islas indigenous community, Madre de Dios region, in southeastern Peru. The data presented was obtained from 12 plots (20 m × 60 m) established in three successional forests abandoned by gold mining and an intact forest (without mining impacts), where all trees with a stem diameter at breast height greater than 1 cm were inventoried. To the best of our knowledge, this is the only dataset in the southwest of the Peruvian Amazon that compares the natural colonization after gold-mining and intact forests. This dataset can be useful for long-term study and monitoring of structure and tree diversity in relatively understudied yet important secondary forests after gold-mining abandonment. Also, this dataset could be used to analyze the successional trajectory process of vegetation and the recovery of aboveground biomass. Furthermore, the data could be used to investigate the effects of functional traits and types of mining on vegetation recovery. Hence, understanding the successional processes will help to improve restoration, reforestation, or reclamation strategies for the recovery of degraded lands in the Amazon.</p>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"111183"},"PeriodicalIF":1.0,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11665693/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142881783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01DOI: 10.1016/j.dib.2024.111125
Daniel Lee , Fernanda C C Oliveira , Richard T. Conant , Minjae Kim
Increasing atmospheric carbon dioxide (CO2) concentrations are impacting the global climate, resulting in significant interest in soil carbon sequestration as a mitigation strategy. While recognized that mineral-associated organic matter (MAOM) in soils is mainly formed through microbial activity, our understanding of microbial-derived MAOM formation processes remains limited due to the complexity of the soil environment. To gain insights into this issue, we incubated fresh soil samples for 45 days with one of three mineral additions: Sand, Kaolinite+Sand, or Illite+Sand. 16S rRNA V3/V4 gene amplicon sequencing was then conducted on samples using an Illumina NextSeq 2000 flow cell. The reads were analyzed and taxonomically assigned with QIIME2 v2023.5.1 and SILVA 138. The dataset has been made publicly available through NCBI GenBank under BioProject ID PRJNA1124235. This dataset is important and useful as it provides valuable insights into the interactions between soil minerals and microbial communities, which can inform strategies for enhancing soil carbon sequestration and mitigating climate change. Moreover, it serves as a crucial reference for future studies, offering a foundational understanding of microbial dynamics in soil systems and guiding further research in microbial ecology and carbon cycling.
{"title":"Microbial community assembly across agricultural soil mineral mesocosms revealed by 16S rRNA gene amplicon sequencing data","authors":"Daniel Lee , Fernanda C C Oliveira , Richard T. Conant , Minjae Kim","doi":"10.1016/j.dib.2024.111125","DOIUrl":"10.1016/j.dib.2024.111125","url":null,"abstract":"<div><div>Increasing atmospheric carbon dioxide (CO<sub>2</sub>) concentrations are impacting the global climate, resulting in significant interest in soil carbon sequestration as a mitigation strategy. While recognized that mineral-associated organic matter (MAOM) in soils is mainly formed through microbial activity, our understanding of microbial-derived MAOM formation processes remains limited due to the complexity of the soil environment. To gain insights into this issue, we incubated fresh soil samples for 45 days with one of three mineral additions: Sand, Kaolinite+Sand, or Illite+Sand. 16S rRNA V3/V4 gene amplicon sequencing was then conducted on samples using an Illumina NextSeq 2000 flow cell. The reads were analyzed and taxonomically assigned with QIIME2 v2023.5.1 and SILVA 138. The dataset has been made publicly available through NCBI GenBank under BioProject ID PRJNA1124235. This dataset is important and useful as it provides valuable insights into the interactions between soil minerals and microbial communities, which can inform strategies for enhancing soil carbon sequestration and mitigating climate change. Moreover, it serves as a crucial reference for future studies, offering a foundational understanding of microbial dynamics in soil systems and guiding further research in microbial ecology and carbon cycling.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"Article 111125"},"PeriodicalIF":1.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Centella asiatica is a significant medicinal herb extensively used in traditional oriental medicine and gaining global popularity. The primary constituents of Centella asiatica leaves are triterpenoid saponins, which are predominantly believed to be responsible for its therapeutic properties. Ensuring the use of high-quality leaves in herbal medicine preparation is crucial across all medicinal practices. To address this quality control issue using machine learning applications, we have developed an image dataset of Centella asiatica leaves. The images were captured using Samsung Galaxy M21 mobile phones and depict the leaves in “Dried,” “Healthy,” and “Unhealthy” states. These states are further divided into “Single” and “Multiple” leaves categories, with “Single” leaves being further classified into “Front” and “Back” views to facilitate a comprehensive study. The images were pre-processed and standardized to 1024 × 768 dimensions, resulting in a dataset comprising a total of 9094 images. This dataset is instrumental in the development and evaluation of image recognition algorithms, serving as a foundational resource for computer vision research. Moreover, it provides a valuable platform for testing and validating algorithms in areas such as image categorization and object detection. For researchers exploring the medicinal potential of Centella asiatica in traditional medicine, this dataset offers critical information on the plantʼs health, thereby advancing research in herbal medicine and ethnopharmacology.
{"title":"Dataset of Centella Asiatica leaves for quality assessment and machine learning applications","authors":"Rohini Jadhav , Mayuri Molawade , Amol Bhosle , Yogesh Suryawanshi , Kailas Patil , Prawit Chumchu","doi":"10.1016/j.dib.2024.111150","DOIUrl":"10.1016/j.dib.2024.111150","url":null,"abstract":"<div><div><em>Centella asiatica</em> is a significant medicinal herb extensively used in traditional oriental medicine and gaining global popularity. The primary constituents of <em>Centella asiatica</em> leaves are triterpenoid saponins, which are predominantly believed to be responsible for its therapeutic properties. Ensuring the use of high-quality leaves in herbal medicine preparation is crucial across all medicinal practices. To address this quality control issue using machine learning applications, we have developed an image dataset of <em>Centella asiatica</em> leaves. The images were captured using Samsung Galaxy M21 mobile phones and depict the leaves in “Dried,” “Healthy,” and “Unhealthy” states. These states are further divided into “Single” and “Multiple” leaves categories, with “Single” leaves being further classified into “Front” and “Back” views to facilitate a comprehensive study. The images were pre-processed and standardized to 1024 × 768 dimensions, resulting in a dataset comprising a total of 9094 images. This dataset is instrumental in the development and evaluation of image recognition algorithms, serving as a foundational resource for computer vision research. Moreover, it provides a valuable platform for testing and validating algorithms in areas such as image categorization and object detection. For researchers exploring the medicinal potential of <em>Centella asiatica</em> in traditional medicine, this dataset offers critical information on the plantʼs health, thereby advancing research in herbal medicine and ethnopharmacology.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"Article 111150"},"PeriodicalIF":1.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142756798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The dataset at hand is a unique resource, officially procured from the Bangladesh Meteorological Department, the sole government institution that diligently monitors weather through 35 strategically placed weather stations across the nation. This dataset is a treasure trove of actual data spanning several decades, from the inception of each weather station to the present. It has been meticulously restructured and processed into four (Rainfall, Temperature, Humidity, and Sunshine) key weather parameters, presented in a highly organized and accessible format. This format not only facilitates its use in the machine-learning training process but also opens up avenues for its application in climate research, weather forecasting, and a myriad of other statistical and machine-learning applications.
{"title":"Climate data dynamics: A high-volume real world structured weather dataset","authors":"Md Zubair , Md. Nafiz Ishtiaque Mahee , Khondaker Masfiq Reza , Md. Shahidul Salim , Nasim Ahmed","doi":"10.1016/j.dib.2024.111156","DOIUrl":"10.1016/j.dib.2024.111156","url":null,"abstract":"<div><div>The dataset at hand is a unique resource, officially procured from the Bangladesh Meteorological Department, the sole government institution that diligently monitors weather through 35 strategically placed weather stations across the nation. This dataset is a treasure trove of actual data spanning several decades, from the inception of each weather station to the present. It has been meticulously restructured and processed into four (Rainfall, Temperature, Humidity, and Sunshine) key weather parameters, presented in a highly organized and accessible format. This format not only facilitates its use in the machine-learning training process but also opens up avenues for its application in climate research, weather forecasting, and a myriad of other statistical and machine-learning applications.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"Article 111156"},"PeriodicalIF":1.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142756799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01DOI: 10.1016/j.dib.2024.111132
Pranajit Kumar Das , Md. Abu Kawsar , Puspendu Biswas Paul , Md. Abdullah Al Mamun Hridoy , Md. Sanowar Hossain , Sabyasachi Niloy
There are about 33,000 different species of fish and they are visually identified using variety of traits, i.e., size and shape of body, head's size and shape, skin pattern, fin pattern, mouth pattern, scale pattern, and eye pattern etc. In traditional manner, identifying these fish species is always difficult with necked eye. Identification and detection of fish species from images using deep learning and computer vision based techniques is challenging topic among researchers worldwide as an interesting problem. Automatic fish species classification and detection has practical importance for both smart aquaculture and fish industry. AI powered deep learning and computer vision based automatic fish species recognition and sorting system becoming significant factor for making aquaculture industry more productive and sustainable. However, the performance of machine learning classifier greatly depends on the size of image dataset and the quality of the images in the dataset. This article demonstrate BD-Freshwater-Fish, an image dataset contain 4389 images of 12 different species captured in natural environment using HD mobile camera from local fish market of Sylhet and Jessore district of Bangladesh. Twelve (12) different data classes are: Rohu (Labeo rohita), Catla (Catla catla), Mrigal (Cirrhinus cirrhosus), Grass Carp (Ctenopharyngodon idella), Common Carp (Cyprinus carpio), Mirror Carp (Cyprinus carpio var. specularis), Black Rohu (Labeo calbasu), Silver Carp (Hypophthalmichthys molitrix), Striped Catfish (Pangasius pangasius), Nile Tilapia (Oreochromis niloticus), Long-whiskered Catfish (Sperata aor), Freshwater Shark (Wallago attu) has been included in the dataset with a different number of images of different species. The BD-Freshwater-Fish dataset is hosted by Department of Computer Science and Engineering mutually with the help of the Department of Aquaculture, Sylhet Agricultural University, Sylhet, Bangladesh.
{"title":"BD-freshwater-fish: An image dataset from Bangladesh for AI-powered automatic fish species classification and detection toward smart aquaculture","authors":"Pranajit Kumar Das , Md. Abu Kawsar , Puspendu Biswas Paul , Md. Abdullah Al Mamun Hridoy , Md. Sanowar Hossain , Sabyasachi Niloy","doi":"10.1016/j.dib.2024.111132","DOIUrl":"10.1016/j.dib.2024.111132","url":null,"abstract":"<div><div>There are about 33,000 different species of fish and they are visually identified using variety of traits, i.e., size and shape of body, head's size and shape, skin pattern, fin pattern, mouth pattern, scale pattern, and eye pattern etc. In traditional manner, identifying these fish species is always difficult with necked eye. Identification and detection of fish species from images using deep learning and computer vision based techniques is challenging topic among researchers worldwide as an interesting problem. Automatic fish species classification and detection has practical importance for both smart aquaculture and fish industry. AI powered deep learning and computer vision based automatic fish species recognition and sorting system becoming significant factor for making aquaculture industry more productive and sustainable. However, the performance of machine learning classifier greatly depends on the size of image dataset and the quality of the images in the dataset. This article demonstrate <em>BD-Freshwater-Fish</em>, an image dataset contain 4389 images of 12 different species captured in natural environment using HD mobile camera from local fish market of Sylhet and Jessore district of Bangladesh. Twelve (12) different data classes are: Rohu (<em>Labeo rohita</em>), Catla (<em>Catla catla</em>), Mrigal (<em>Cirrhinus cirrhosus</em>), Grass Carp (<em>Ctenopharyngodon idella)</em>, Common Carp (<em>Cyprinus carpio</em>), Mirror Carp (<em>Cyprinus carpio</em> var. specularis), Black Rohu (<em>Labeo calbasu</em>), Silver Carp (<em>Hypophthalmichthys molitrix),</em> Striped Catfish (<em>Pangasius pangasius</em>), Nile Tilapia (<em>Oreochromis niloticus</em>), Long-whiskered Catfish (<em>Sperata aor</em>), Freshwater Shark (<em>Wallago attu</em>) has been included in the dataset with a different number of images of different species. The <em>BD-Freshwater-Fish</em> dataset is hosted by Department of Computer Science and Engineering mutually with the help of the Department of Aquaculture, Sylhet Agricultural University, Sylhet, Bangladesh.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"Article 111132"},"PeriodicalIF":1.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}