Journal of Big Data最新文献_第3页

Crude oil price forecasting using K-means clustering and LSTM model enhanced by dense-sparse-dense strategy 使用 K-均值聚类和通过密集-稀疏-密集策略增强的 LSTM 模型预测原油价格

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-17 DOI: 10.1186/s40537-024-00977-8

Alireza Jahandoost, Farhad Abedinzadeh Torghabeh, Seyyed Abed Hosseini, Mahboobeh Houshmand

Crude oil is an essential energy source that affects international trade, transportation, and manufacturing, highlighting its importance to the economy. Its future price prediction affects consumer prices and the energy markets, and it shapes the development of sustainable energy. It is essential for financial planning, economic stability, and investment decisions. However, reaching a reliable future prediction is an open issue because of its high volatility. Furthermore, many state-of-the-art methods utilize signal decomposition techniques, which can lead to increased prediction time. In this paper, a model called K-means-dense-sparse-dense long short-term memory (K-means-DSD-LSTM) is proposed, which has three main training phrases for crude oil price forecasting. In the first phase, the DSD-LSTM model is trained. Afterwards, the training part of the data is clustered using the K-means algorithm. Finally, a copy of the trained DSD-LSTM model is fine-tuned for each obtained cluster. It helps the models predict that cluster better while they are generalizing the whole dataset quite well, which diminishes overfitting. The proposed model is evaluated on two famous crude oil benchmarks: West Texas Intermediate (WTI) and Brent. Empirical evaluations demonstrated the superiority of the DSD-LSTM model over the K-means-LSTM model. Furthermore, the K-means-DSD-LSTM model exhibited even stronger performance. Notably, the proposed method yielded promising results across diverse datasets, achieving competitive performance in comparison to existing methods, even without employing signal decomposition techniques.

原油是影响国际贸易、运输和制造业的重要能源，对经济的重要性不言而喻。它对未来价格的预测影响着消费价格和能源市场，并左右着可持续能源的发展。它对财务规划、经济稳定和投资决策至关重要。然而，由于其高度波动性，实现可靠的未来预测是一个尚未解决的问题。此外，许多最先进的方法都采用了信号分解技术，这会导致预测时间的增加。本文提出了一种名为 K-means-dense-sparse-dense long short-term memory（K-means-DSD-LSTM）的模型，该模型有三个主要训练阶段，用于原油价格预测。在第一阶段，对 DSD-LSTM 模型进行训练。然后，使用 K-means 算法对数据的训练部分进行聚类。最后，针对每个获得的聚类对训练好的 DSD-LSTM 模型的副本进行微调。这有助于模型在很好地泛化整个数据集的同时，更好地预测该聚类，从而减少过拟合。我们在两个著名的原油基准上对所提出的模型进行了评估：西德克萨斯中质原油（WTI）和布伦特原油。经验评估表明，DSD-LSTM 模型优于 K-means-LSTM 模型。此外，K-means-DSD-LSTM 模型表现出更强的性能。值得注意的是，所提出的方法在各种数据集上都取得了可喜的成果，与现有方法相比，即使不采用信号分解技术，也能取得具有竞争力的性能。

{"title":"Crude oil price forecasting using K-means clustering and LSTM model enhanced by dense-sparse-dense strategy","authors":"Alireza Jahandoost, Farhad Abedinzadeh Torghabeh, Seyyed Abed Hosseini, Mahboobeh Houshmand","doi":"10.1186/s40537-024-00977-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00977-8","url":null,"abstract":"<p>Crude oil is an essential energy source that affects international trade, transportation, and manufacturing, highlighting its importance to the economy. Its future price prediction affects consumer prices and the energy markets, and it shapes the development of sustainable energy. It is essential for financial planning, economic stability, and investment decisions. However, reaching a reliable future prediction is an open issue because of its high volatility. Furthermore, many state-of-the-art methods utilize signal decomposition techniques, which can lead to increased prediction time. In this paper, a model called K-means-dense-sparse-dense long short-term memory (K-means-DSD-LSTM) is proposed, which has three main training phrases for crude oil price forecasting. In the first phase, the DSD-LSTM model is trained. Afterwards, the training part of the data is clustered using the K-means algorithm. Finally, a copy of the trained DSD-LSTM model is fine-tuned for each obtained cluster. It helps the models predict that cluster better while they are generalizing the whole dataset quite well, which diminishes overfitting. The proposed model is evaluated on two famous crude oil benchmarks: West Texas Intermediate (WTI) and Brent. Empirical evaluations demonstrated the superiority of the DSD-LSTM model over the K-means-LSTM model. Furthermore, the K-means-DSD-LSTM model exhibited even stronger performance. Notably, the proposed method yielded promising results across diverse datasets, achieving competitive performance in comparison to existing methods, even without employing signal decomposition techniques.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"5 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rs-net: Residual Sharp U-Net architecture for pavement crack segmentation and severity assessment Rs-net：用于路面裂缝细分和严重程度评估的残差夏普 U-Net 架构

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-17 DOI: 10.1186/s40537-024-00981-y

Luqman Ali, Hamad AlJassmi, Mohammed Swavaf, Wasif Khan, Fady Alnajjar

U-net, a fully convolutional network-based image segmentation method, has demonstrated widespread adaptability in the crack segmentation task. The combination of the semantically dissimilar features of the encoder (shallow layers) and the decoder (deep layers) in the skip connections leads to blurry features map and leads to undesirable over- or under-segmentation of target regions. Additionally, the shallow architecture of the U-Net model prevents the extraction of more discriminatory information from input images. This paper proposes a Residual Sharp U-Net (RS-Net) architecture for crack segmentation and severity assessment in pavement surfaces to address these limitations. The proposed architecture uses residual block in the U-Net model to extract a more insightful representation of features. In addition to that, a sharpening kernel filter is used instead of plain skip connections to generate a fine-tuned encoder features map before combining it with decoder features maps to reduce the dissimilarity between them and smoothes artifacts in the network layers during early training. The proposed architecture is also integrated with various morphological operations to assess the severity of cracks and categorize them into hairline, medium, and severe labels. Experiments results demonstrated that the RS-Net model has promising segmentation performance, outperforming earlier U-Net variations on testing data for crack segmentation and severity assessment, with a promising accuracy (>0.97)

U-net 是一种基于全卷积网络的图像分割方法，在裂缝分割任务中表现出广泛的适应性。在跳转连接中，编码器（浅层）和解码器（深层）在语义上不同的特征结合在一起，导致特征图模糊不清，从而导致目标区域的过度或不足分割。此外，U-Net 模型的浅层结构阻碍了从输入图像中提取更多的判别信息。本文提出了一种用于路面裂缝分割和严重程度评估的残余锐U-Net（RS-Net）架构，以解决这些局限性。建议的架构使用 U-Net 模型中的残差块来提取更有洞察力的特征表示。此外，还使用了锐化内核滤波器来代替普通的跳过连接，以生成微调编码器特征图，然后再将其与解码器特征图相结合，从而降低它们之间的差异，并在早期训练过程中平滑网络层中的人工痕迹。所提出的架构还与各种形态学运算相结合，以评估裂纹的严重程度，并将其分为发丝裂纹、中等裂纹和严重裂纹。实验结果表明，RS-Net 模型具有良好的分割性能，在裂缝分割和严重程度评估的测试数据上，其准确率（>0.97）优于早期的 U-Net 变体。

{"title":"Rs-net: Residual Sharp U-Net architecture for pavement crack segmentation and severity assessment","authors":"Luqman Ali, Hamad AlJassmi, Mohammed Swavaf, Wasif Khan, Fady Alnajjar","doi":"10.1186/s40537-024-00981-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00981-y","url":null,"abstract":"<p>U-net, a fully convolutional network-based image segmentation method, has demonstrated widespread adaptability in the crack segmentation task. The combination of the semantically dissimilar features of the encoder (shallow layers) and the decoder (deep layers) in the skip connections leads to blurry features map and leads to undesirable over- or under-segmentation of target regions. Additionally, the shallow architecture of the U-Net model prevents the extraction of more discriminatory information from input images. This paper proposes a Residual Sharp U-Net (RS-Net) architecture for crack segmentation and severity assessment in pavement surfaces to address these limitations. The proposed architecture uses residual block in the U-Net model to extract a more insightful representation of features. In addition to that, a sharpening kernel filter is used instead of plain skip connections to generate a fine-tuned encoder features map before combining it with decoder features maps to reduce the dissimilarity between them and smoothes artifacts in the network layers during early training. The proposed architecture is also integrated with various morphological operations to assess the severity of cracks and categorize them into hairline, medium, and severe labels. Experiments results demonstrated that the RS-Net model has promising segmentation performance, outperforming earlier U-Net variations on testing data for crack segmentation and severity assessment, with a promising accuracy (>0.97)</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"80 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Internet of things and ensemble learning-based mental and physical fatigue monitoring for smart construction sites 基于物联网和集合学习的智能建筑工地身心疲劳监测

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-16 DOI: 10.1186/s40537-024-00978-7

Bubryur Kim, K. R. Sri Preethaa, Sujeen Song, R. R. Lukacs, Jinwoo An, Zengshun Chen, Euijung An, Sungho Kim

The construction industry substantially contributes to the economic growth of a country. However, it records a large number of workplace injuries and fatalities annually due to its hesitant adoption of automated safety monitoring systems. To address this critical concern, this study presents a real-time monitoring approach that uses the Internet of Things and ensemble learning. This study leverages wearable sensor technology, such as photoplethysmography and electroencephalography sensors, to continuously track the physiological parameters of construction workers. The sensor data is processed using an ensemble learning approach called the ChronoEnsemble Fatigue Analysis System (CEFAS), comprising deep autoregressive and temporal fusion transformer models, to accurately predict potential physical and mental fatigue. Comprehensive evaluation metrics, including mean square error, mean absolute scaled error, and symmetric mean absolute percentage error, demonstrated the superior prediction accuracy and reliability of the proposed model compared to standalone models. The ensemble learning model exhibited remarkable precision in predicting physical and mental fatigue, as evidenced by the mean square errors of 0.0008 and 0.0033, respectively. The proposed model promptly recognizes potential hazards and irregularities, considerably enhancing worker safety and reducing on-site risks.

建筑业为国家的经济增长做出了巨大贡献。然而，由于迟迟未采用自动化安全监控系统，该行业每年都会发生大量工伤和死亡事故。为了解决这一重大问题，本研究提出了一种利用物联网和集合学习的实时监控方法。本研究利用可穿戴传感器技术，如光电血压计和脑电图传感器，持续跟踪建筑工人的生理参数。传感器数据通过一种名为 "ChronoEnsemble Fatigue Analysis System（CEFAS）"的集合学习方法进行处理，该方法由深度自回归模型和时间融合变换模型组成，可准确预测潜在的身体和精神疲劳。包括均方误差、均值绝对缩放误差和对称均值绝对百分比误差在内的综合评估指标表明，与独立模型相比，所提出的模型具有更高的预测准确性和可靠性。集合学习模型在预测身体疲劳和精神疲劳方面表现出显著的精确性，其均方误差分别为 0.0008 和 0.0033。所提出的模型能及时识别潜在的危险和异常情况，从而大大提高了工人的安全性，降低了现场风险。

{"title":"Internet of things and ensemble learning-based mental and physical fatigue monitoring for smart construction sites","authors":"Bubryur Kim, K. R. Sri Preethaa, Sujeen Song, R. R. Lukacs, Jinwoo An, Zengshun Chen, Euijung An, Sungho Kim","doi":"10.1186/s40537-024-00978-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00978-7","url":null,"abstract":"<p>The construction industry substantially contributes to the economic growth of a country. However, it records a large number of workplace injuries and fatalities annually due to its hesitant adoption of automated safety monitoring systems. To address this critical concern, this study presents a real-time monitoring approach that uses the Internet of Things and ensemble learning. This study leverages wearable sensor technology, such as photoplethysmography and electroencephalography sensors, to continuously track the physiological parameters of construction workers. The sensor data is processed using an ensemble learning approach called the ChronoEnsemble Fatigue Analysis System (CEFAS), comprising deep autoregressive and temporal fusion transformer models, to accurately predict potential physical and mental fatigue. Comprehensive evaluation metrics, including mean square error, mean absolute scaled error, and symmetric mean absolute percentage error, demonstrated the superior prediction accuracy and reliability of the proposed model compared to standalone models. The ensemble learning model exhibited remarkable precision in predicting physical and mental fatigue, as evidenced by the mean square errors of 0.0008 and 0.0033, respectively. The proposed model promptly recognizes potential hazards and irregularities, considerably enhancing worker safety and reducing on-site risks.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"42 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward a globally lunar calendar: a machine learning-driven approach for crescent moon visibility prediction 实现全球月历：新月能见度预测的机器学习驱动方法

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-12 DOI: 10.1186/s40537-024-00979-6

Samia Loucif, Murad Al-Rajab, Raed Abu Zitar, Mahmoud Rezk

This paper presents a comprehensive approach to harmonizing lunar calendars across different global regions, addressing the long-standing challenge of variations in new crescent Moon sightings that mark the beginning of lunar months. We propose a machine learning (ML)-based framework to predict the visibility of the new crescent Moon, representing a significant advancement toward a globally unified lunar calendar. Our study utilized a dataset covering various countries globally, making it the first to analyze all 12 lunar months over a span of 13 years. We applied a wide array of ML algorithms and techniques. These techniques included feature selection, hyperparameter tuning, ensemble learning, and region-based clustering, all aimed at maximizing the model’s performance. The overall results reveal that the gradient boosting (GB) model surpasses all other models, achieving the highest F1 score of 0.882469 and an area under the curve (AUC) of 0.901009. However, with selected features identified through the ANOVA F-test and optimized parameters, the Extra Trees model exhibited the best performance with an F1 score of 0.887872, and an AUC of 0.906242. We expanded our analysis to explore ensemble models, aiming to understand how a combination of models might boost predictive accuracy. The Ensemble Model exhibited a slight improvement, with an F1 score of 0.888058 and an AUC of 0.907482. Additionally, the geographical segmentation of the dataset enhanced predictive performance in certain areas, such as Africa and Asia. In conclusion, ML techniques can provide efficient and reliable tool for predicting the new crescent Moon visibility that would support the decisions of marking the beginning of new lunar months.

本文提出了一种协调全球不同地区农历的综合方法，以解决标志着农历月份开始的新月视线变化这一长期存在的难题。我们提出了一个基于机器学习（ML）的框架来预测新月的能见度，这代表着向全球统一的农历迈进了一大步。我们的研究利用了一个涵盖全球多个国家的数据集，这也是首个对 13 年间所有 12 个农历月份进行分析的研究。我们应用了多种 ML 算法和技术。这些技术包括特征选择、超参数调整、集合学习和基于区域的聚类，所有这些都旨在最大限度地提高模型的性能。总体结果显示，梯度提升（GB）模型超越了所有其他模型，获得了最高的 F1 分数 0.882469 和曲线下面积（AUC）0.901009。然而，通过方差分析 F 检验和优化参数确定的选定特征，Extra Trees 模型表现出最佳性能，F1 得分为 0.887872，AUC 为 0.906242。我们扩大了分析范围，探索了集合模型，旨在了解模型组合如何提高预测准确性。集合模型略有改进，F1 得分为 0.888058，AUC 为 0.907482。此外，数据集的地理细分也提高了某些地区（如非洲和亚洲）的预测性能。总之，ML 技术可以为预测新月能见度提供高效、可靠的工具，从而为标记新月开始的决策提供支持。

{"title":"Toward a globally lunar calendar: a machine learning-driven approach for crescent moon visibility prediction","authors":"Samia Loucif, Murad Al-Rajab, Raed Abu Zitar, Mahmoud Rezk","doi":"10.1186/s40537-024-00979-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00979-6","url":null,"abstract":"<p>This paper presents a comprehensive approach to harmonizing lunar calendars across different global regions, addressing the long-standing challenge of variations in new crescent Moon sightings that mark the beginning of lunar months. We propose a machine learning (ML)-based framework to predict the visibility of the new crescent Moon, representing a significant advancement toward a globally unified lunar calendar. Our study utilized a dataset covering various countries globally, making it the first to analyze all 12 lunar months over a span of 13 years. We applied a wide array of ML algorithms and techniques. These techniques included feature selection, hyperparameter tuning, ensemble learning, and region-based clustering, all aimed at maximizing the model’s performance. The overall results reveal that the gradient boosting (GB) model surpasses all other models, achieving the highest F1 score of 0.882469 and an area under the curve (AUC) of 0.901009. However, with selected features identified through the ANOVA F-test and optimized parameters, the Extra Trees model exhibited the best performance with an F1 score of 0.887872, and an AUC of 0.906242. We expanded our analysis to explore ensemble models, aiming to understand how a combination of models might boost predictive accuracy. The Ensemble Model exhibited a slight improvement, with an F1 score of 0.888058 and an AUC of 0.907482. Additionally, the geographical segmentation of the dataset enhanced predictive performance in certain areas, such as Africa and Asia. In conclusion, ML techniques can provide efficient and reliable tool for predicting the new crescent Moon visibility that would support the decisions of marking the beginning of new lunar months.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"4 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications 增强 K 近邻算法：对修改的全面回顾和性能分析

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-11 DOI: 10.1186/s40537-024-00973-y

Rajib Kumar Halder, Mohammed Nasir Uddin, Md. Ashraf Uddin, Sunil Aryal, Ansam Khraisat

The k-Nearest Neighbors (kNN) method, established in 1951, has since evolved into a pivotal tool in data mining, recommendation systems, and Internet of Things (IoT), among other areas. This paper presents a comprehensive review and performance analysis of modifications made to enhance the exact kNN techniques, particularly focusing on kNN Search and kNN Join for high-dimensional data. We delve deep into 31 kNN search methods and 12 kNN join methods, providing a methodological overview and analytical insight into each, emphasizing their strengths, limitations, and applicability. An important feature of our study is the provision of the source code for each of the kNN methods discussed, fostering ease of experimentation and comparative analysis for readers. Motivated by the rising significance of kNN in high-dimensional spaces and a recognized gap in comprehensive surveys on exact kNN techniques, our work seeks to bridge this gap. Additionally, we outline existing challenges and present potential directions for future research in the domain of kNN techniques, offering a holistic guide that amalgamates, compares, and dissects existing methodologies in a coherent manner.

Graphical Abstract

k-Nearest Neighbors（kNN）方法创立于 1951 年，现已发展成为数据挖掘、推荐系统和物联网（IoT）等领域的重要工具。本文全面回顾和分析了为增强精确 kNN 技术而进行的修改，尤其是针对高维数据的 kNN Search 和 kNN Join。我们深入研究了 31 种 kNN 搜索方法和 12 种 kNN 连接方法，对每种方法进行了方法概述和分析，强调了它们的优势、局限性和适用性。我们研究的一个重要特点是提供了所讨论的每种 kNN 方法的源代码，便于读者进行实验和比较分析。由于 kNN 在高维空间中的重要性日益凸显，而关于精确 kNN 技术的全面研究又存在公认的空白，因此我们的研究试图弥补这一空白。此外，我们还概述了 kNN 技术领域的现有挑战，并提出了未来研究的潜在方向，从而提供了一个整体指南，以连贯一致的方式对现有方法进行整合、比较和剖析。

{"title":"Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications","authors":"Rajib Kumar Halder, Mohammed Nasir Uddin, Md. Ashraf Uddin, Sunil Aryal, Ansam Khraisat","doi":"10.1186/s40537-024-00973-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00973-y","url":null,"abstract":"<p>The k-Nearest Neighbors (kNN) method, established in 1951, has since evolved into a pivotal tool in data mining, recommendation systems, and Internet of Things (IoT), among other areas. This paper presents a comprehensive review and performance analysis of modifications made to enhance the exact kNN techniques, particularly focusing on kNN Search and kNN Join for high-dimensional data. We delve deep into 31 kNN search methods and 12 kNN join methods, providing a methodological overview and analytical insight into each, emphasizing their strengths, limitations, and applicability. An important feature of our study is the provision of the source code for each of the kNN methods discussed, fostering ease of experimentation and comparative analysis for readers. Motivated by the rising significance of kNN in high-dimensional spaces and a recognized gap in comprehensive surveys on exact kNN techniques, our work seeks to bridge this gap. Additionally, we outline existing challenges and present potential directions for future research in the domain of kNN techniques, offering a holistic guide that amalgamates, compares, and dissects existing methodologies in a coherent manner.</p><h3 data-test=\"abstract-sub-heading\">Graphical Abstract</h3>\u0000","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"22 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analysis of Graeco-Latin square designs in the presence of uncertain data 在数据不确定的情况下分析希腊-拉丁方形设计

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-07 DOI: 10.1186/s40537-024-00970-1

Abdulrahman AlAita, Muhammad Aslam, Khaled Al Sultan, Muhammad Saleem

Objective

This paper addresses the Graeco-Latin square design (GLSD) under neutrosophic statistics. In this work, we propose a novel approach for analyzing Graeco-Latin square designs using uncertain observations.

Method

This approach involves the determination of a neutrosophic ANOVA and the determination of the neutrosophic hypotheses and decision rule.

Results

The performance of the proposed design is evaluated using the numerical examples and simulation study.

Conclusion

Based on the results observed, it can be concluded that the GLSD under neutrosophic statistics performs better than the GLSD under classical statistics in the presence of uncertainty.

本文探讨了中性统计下的格拉诺-拉丁方阵设计（GLSD）。方法该方法包括确定中性方差分析以及确定中性假设和决策规则。结果利用数值示例和模拟研究评估了拟议设计的性能。结论根据观察到的结果，可以得出结论：在存在不确定性的情况下，中性统计下的 GLSD 比经典统计下的 GLSD 性能更好。

引用次数: 0

Memetic multilabel feature selection using pruned refinement process 使用剪枝细化过程的记忆多标签特征选择

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-06 DOI: 10.1186/s40537-024-00961-2

Wangduk Seo, Jaegyun Park, Sanghyuck Lee, A-Seong Moon, Dae-Won Kim, Jaesung Lee

With the growing complexity of data structures, which include high-dimensional and multilabel datasets, the significance of feature selection has become more emphasized. Multilabel feature selection endeavors to identify a subset of features that concurrently exhibit relevance across multiple labels. Owing to the impracticality of performing exhaustive searches to obtain the optimal feature subset, conventional approaches in multilabel feature selection often resort to a heuristic search process. In this context, memetic multilabel feature selection has received considerable attention because of its superior search capability; the fitness of the feature subset created by the stochastic search is further enhanced through a refinement process predicated on the employed multilabel feature filter. Thus, it is imperative to employ an effective refinement process that frequently succeeds in improving the target feature subset to maximize the benefits of hybridization. However, the refinement process in conventional memetic multilabel feature selection often overlooks potential biases in feature scores and compatibility issues between the multilabel feature filter and the subsequent learner. Consequently, conventional methods may not effectively identify the optimal feature subset in complex multilabel datasets. In this study, we propose a new memetic multilabel feature selection method that addresses these limitations by incorporating the pruning of features and labels into the refinement process. The effectiveness of the proposed method was demonstrated through experiments on 14 multilabel datasets.

随着包括高维和多标签数据集在内的数据结构日益复杂，特征选择的重要性变得更加突出。多标签特征选择的目的是找出同时与多个标签相关的特征子集。由于进行穷举搜索以获得最佳特征子集不切实际，多标签特征选择的传统方法通常采用启发式搜索过程。在这种情况下，记忆式多标签特征选择因其卓越的搜索能力而备受关注；随机搜索创建的特征子集的合适度通过基于所采用的多标签特征过滤器的细化过程得到进一步提高。因此，必须采用有效的细化过程，经常成功地改进目标特征子集，以最大限度地发挥混合的优势。然而，传统记忆多标签特征选择中的细化过程往往会忽略特征得分中的潜在偏差以及多标签特征过滤器与后续学习器之间的兼容性问题。因此，传统方法可能无法有效识别复杂多标签数据集中的最优特征子集。在本研究中，我们提出了一种新的记忆多标签特征选择方法，通过将特征和标签的剪枝纳入细化过程，解决了这些局限性。通过对 14 个多标签数据集的实验，证明了所提方法的有效性。

{"title":"Memetic multilabel feature selection using pruned refinement process","authors":"Wangduk Seo, Jaegyun Park, Sanghyuck Lee, A-Seong Moon, Dae-Won Kim, Jaesung Lee","doi":"10.1186/s40537-024-00961-2","DOIUrl":"https://doi.org/10.1186/s40537-024-00961-2","url":null,"abstract":"<p>With the growing complexity of data structures, which include high-dimensional and multilabel datasets, the significance of feature selection has become more emphasized. Multilabel feature selection endeavors to identify a subset of features that concurrently exhibit relevance across multiple labels. Owing to the impracticality of performing exhaustive searches to obtain the optimal feature subset, conventional approaches in multilabel feature selection often resort to a heuristic search process. In this context, memetic multilabel feature selection has received considerable attention because of its superior search capability; the fitness of the feature subset created by the stochastic search is further enhanced through a refinement process predicated on the employed multilabel feature filter. Thus, it is imperative to employ an effective refinement process that frequently succeeds in improving the target feature subset to maximize the benefits of hybridization. However, the refinement process in conventional memetic multilabel feature selection often overlooks potential biases in feature scores and compatibility issues between the multilabel feature filter and the subsequent learner. Consequently, conventional methods may not effectively identify the optimal feature subset in complex multilabel datasets. In this study, we propose a new memetic multilabel feature selection method that addresses these limitations by incorporating the pruning of features and labels into the refinement process. The effectiveness of the proposed method was demonstrated through experiments on 14 multilabel datasets.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"25 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unlocking the potential of Naive Bayes for spatio temporal classification: a novel approach to feature expansion 释放 Naive Bayes 在时空分类方面的潜力：特征扩展的新方法

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-05 DOI: 10.1186/s40537-024-00958-x

Sri Suryani Prasetiyowati, Yuliant Sibaroni

Prediction processes in areas ranging from climate and disease spread to disasters and air pollution rely heavily on spatial–temporal data. Understanding and forecasting the distribution patterns of disease cases and climate change phenomena has become a focal point of researchers around the world. Machine learning models for prediction can generally be classified into 2: based on previous patterns such as LSTM and based on causal factors such as Naive Bayes and other classifiers. The main drawback of models such as Naive Bayes is that it does not have the ability to predict future trends because it only make predictionsin the present time. In this study, we propose a novel approach that makes the Naive Bayes classifier capable of predicting future classification. The process of expanding the dimension of the feature matrix based on historical data from several previous time periods is performed to obtain a long-term classification prediction model using Naive Bayes. The case studies used are the prediction of the distribution of the annual number of dengue fever cases in Bandung City and the distribution of monthly rainfall in Java Island, Indonesia. Through rigorous testing, we demonstrate the effectiveness of this Time-Based Feature Expansion approach in Naive Bayes in accurately predicting the distribution of annual dengue fever cases in 30 sub-districts in Bandung City and monthly rainfall in Java Island, Indonesia with with both accuracy and F1-score reaching more than 97%.

Graphical Abstract

从气候和疾病传播到灾害和空气污染等领域的预测过程在很大程度上依赖于时空数据。了解和预测疾病病例的分布模式和气候变化现象已成为全球研究人员关注的焦点。用于预测的机器学习模型一般可分为两种：基于以往模式的模型，如 LSTM；基于因果因素的模型，如 Naive Bayes 和其他分类器。Naive Bayes 等模型的主要缺点是无法预测未来趋势，因为它只能预测当前时间。在本研究中，我们提出了一种新方法，使 Naive Bayes 分类器能够预测未来分类。根据之前几个时间段的历史数据，对特征矩阵的维度进行扩展，从而利用 Naive Bayes 获得长期分类预测模型。使用的案例研究是预测万隆市登革热病例的年度分布和印度尼西亚爪哇岛的月降雨量分布。通过严格的测试，我们证明了在 Naive Bayes 中使用这种基于时间的特征扩展方法在准确预测万隆市 30 个分区的登革热病例年分布和印度尼西亚爪哇岛的月降雨量分布方面的有效性，准确率和 F1 分数均达到 97% 以上。

{"title":"Unlocking the potential of Naive Bayes for spatio temporal classification: a novel approach to feature expansion","authors":"Sri Suryani Prasetiyowati, Yuliant Sibaroni","doi":"10.1186/s40537-024-00958-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00958-x","url":null,"abstract":"<p>Prediction processes in areas ranging from climate and disease spread to disasters and air pollution rely heavily on spatial–temporal data. Understanding and forecasting the distribution patterns of disease cases and climate change phenomena has become a focal point of researchers around the world. Machine learning models for prediction can generally be classified into 2: based on previous patterns such as LSTM and based on causal factors such as Naive Bayes and other classifiers. The main drawback of models such as Naive Bayes is that it does not have the ability to predict future trends because it only make predictionsin the present time. In this study, we propose a novel approach that makes the Naive Bayes classifier capable of predicting future classification. The process of expanding the dimension of the feature matrix based on historical data from several previous time periods is performed to obtain a long-term classification prediction model using Naive Bayes. The case studies used are the prediction of the distribution of the annual number of dengue fever cases in Bandung City and the distribution of monthly rainfall in Java Island, Indonesia. Through rigorous testing, we demonstrate the effectiveness of this Time-Based Feature Expansion approach in Naive Bayes in accurately predicting the distribution of annual dengue fever cases in 30 sub-districts in Bandung City and monthly rainfall in Java Island, Indonesia with with both accuracy and F1-score reaching more than 97%.</p><h3 data-test=\"abstract-sub-heading\">Graphical Abstract</h3>\u0000","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"82 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sentiment-based predictive models for online purchases in the era of marketing 5.0: a systematic review 营销 5.0 时代基于情感的网购预测模型：系统综述

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-05 DOI: 10.1186/s40537-024-00947-0

Veerajay Gooljar, Tomayess Issa, Sarita Hardin-Ramanan, Bilal Abu-Salih

The convergence of artificial intelligence (AI), big data (DB), and Internet of Things (IoT) in Society 5.0, has given rise to Marketing 5.0, revolutionizing personalized customer experiences. In this study, a systematic literature review was conducted to examine the integration of predictive modelling and sentiment analysis within the Marketing 5.0 domain. Unlike previous research, this study addresses both aspects within a single context, emphasizing the need for a sentiment-based predictive approach to the buyers’ journey. This review explores how predictive and sentiment models enhance customer experience, inform business decisions, and optimize marketing processes. This study contributes to the literature by identifying areas of improvement in predictive modelling and emphasizes the role of a sentiment-based approach in Marketing 5.0. The sentiment-based model assists businesses in understanding customer preferences, offering personalized products, and enabling customers to receive relevant advertisements during their purchase journey. The paper’s structure covers the evolution of traditional marketing to digital marketing, AI’s role in digital marketing, predictive modelling in marketing, and the significance of analyzing customer sentiments in their reviews. The Prisma-P methodology, research questions, and suggestions for future work and limitations provide a comprehensive overview of the scope and contributions of this review.

人工智能（AI）、大数据（DB）和物联网（IoT）在社会 5.0 中的融合催生了市场营销 5.0，彻底改变了个性化客户体验。在本研究中，我们进行了系统的文献综述，以研究预测建模和情感分析在营销 5.0 领域中的整合。与以往研究不同的是，本研究将这两个方面放在一个背景下进行探讨，强调了在买家旅程中采用基于情感的预测方法的必要性。本综述探讨了预测模型和情感模型如何提升客户体验、为业务决策提供信息并优化营销流程。本研究通过确定预测建模的改进领域，并强调基于情感的方法在营销 5.0 中的作用，为相关文献做出了贡献。基于情感的模型有助于企业了解客户偏好，提供个性化产品，并使客户在购买过程中接收相关广告。本文的结构涵盖了传统营销向数字营销的演变、人工智能在数字营销中的作用、营销中的预测建模以及分析客户评论中的情感的意义。Prisma-P 方法、研究问题以及对未来工作的建议和局限性全面概述了本综述的范围和贡献。

{"title":"Sentiment-based predictive models for online purchases in the era of marketing 5.0: a systematic review","authors":"Veerajay Gooljar, Tomayess Issa, Sarita Hardin-Ramanan, Bilal Abu-Salih","doi":"10.1186/s40537-024-00947-0","DOIUrl":"https://doi.org/10.1186/s40537-024-00947-0","url":null,"abstract":"<p>The convergence of artificial intelligence (AI), big data (DB), and Internet of Things (IoT) in Society 5.0, has given rise to Marketing 5.0, revolutionizing personalized customer experiences. In this study, a systematic literature review was conducted to examine the integration of predictive modelling and sentiment analysis within the Marketing 5.0 domain. Unlike previous research, this study addresses both aspects within a single context, emphasizing the need for a sentiment-based predictive approach to the buyers’ journey. This review explores how predictive and sentiment models enhance customer experience, inform business decisions, and optimize marketing processes. This study contributes to the literature by identifying areas of improvement in predictive modelling and emphasizes the role of a sentiment-based approach in Marketing 5.0. The sentiment-based model assists businesses in understanding customer preferences, offering personalized products, and enabling customers to receive relevant advertisements during their purchase journey. The paper’s structure covers the evolution of traditional marketing to digital marketing, AI’s role in digital marketing, predictive modelling in marketing, and the significance of analyzing customer sentiments in their reviews. The Prisma-P methodology, research questions, and suggestions for future work and limitations provide a comprehensive overview of the scope and contributions of this review.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"73 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing cybersecurity: a comprehensive review of AI-driven detection techniques 推进网络安全：全面审查人工智能驱动的检测技术

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-08-04 DOI: 10.1186/s40537-024-00957-y

Aya H. Salem, Safaa M. Azzam, O. E. Emam, Amr A. Abohany

As the number and cleverness of cyber-attacks keep increasing rapidly, it's more important than ever to have good ways to detect and prevent them. Recognizing cyber threats quickly and accurately is crucial because they can cause severe damage to individuals and businesses. This paper takes a close look at how we can use artificial intelligence (AI), including machine learning (ML) and deep learning (DL), alongside metaheuristic algorithms to detect cyber-attacks better. We've thoroughly examined over sixty recent studies to measure how effective these AI tools are at identifying and fighting a wide range of cyber threats. Our research includes a diverse array of cyberattacks such as malware attacks, network intrusions, spam, and others, showing that ML and DL methods, together with metaheuristic algorithms, significantly improve how well we can find and respond to cyber threats. We compare these AI methods to find out what they're good at and where they could improve, especially as we face new and changing cyber-attacks. This paper presents a straightforward framework for assessing AI Methods in cyber threat detection. Given the increasing complexity of cyber threats, enhancing AI methods and regularly ensuring strong protection is critical. We evaluate the effectiveness and the limitations of current ML and DL proposed models, in addition to the metaheuristic algorithms. Recognizing these limitations is vital for guiding future enhancements. We're pushing for smart and flexible solutions that can adapt to new challenges. The findings from our research suggest that the future of protecting against cyber-attacks will rely on continuously updating AI methods to stay ahead of hackers' latest tricks.

随着网络攻击的数量和巧妙程度不断迅速增加，拥有检测和预防网络攻击的好方法比以往任何时候都更加重要。快速准确地识别网络威胁至关重要，因为它们会对个人和企业造成严重损害。本文将仔细研究我们如何利用人工智能（AI），包括机器学习（ML）和深度学习（DL），以及元启发式算法来更好地检测网络攻击。我们深入研究了最近的 60 多项研究，以衡量这些人工智能工具在识别和打击各种网络威胁方面的有效性。我们的研究包括各种网络攻击，如恶意软件攻击、网络入侵、垃圾邮件等，结果表明，ML 和 DL 方法与元启发式算法一起使用，能显著提高我们发现和应对网络威胁的能力。我们对这些人工智能方法进行了比较，以找出它们的长处和可以改进之处，尤其是在我们面临不断变化的新型网络攻击时。本文提出了一个简单明了的框架，用于评估网络威胁检测中的人工智能方法。鉴于网络威胁日益复杂，加强人工智能方法并定期确保强有力的保护至关重要。除了元启发式算法外，我们还评估了当前 ML 和 DL 拟议模型的有效性和局限性。认识到这些局限性对于指导未来的改进至关重要。我们正在推动能够适应新挑战的智能灵活解决方案。我们的研究结果表明，防范网络攻击的未来将依赖于不断更新的人工智能方法，以领先于黑客的最新伎俩。

{"title":"Advancing cybersecurity: a comprehensive review of AI-driven detection techniques","authors":"Aya H. Salem, Safaa M. Azzam, O. E. Emam, Amr A. Abohany","doi":"10.1186/s40537-024-00957-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00957-y","url":null,"abstract":"<p>As the number and cleverness of cyber-attacks keep increasing rapidly, it's more important than ever to have good ways to detect and prevent them. Recognizing cyber threats quickly and accurately is crucial because they can cause severe damage to individuals and businesses. This paper takes a close look at how we can use artificial intelligence (AI), including machine learning (ML) and deep learning (DL), alongside metaheuristic algorithms to detect cyber-attacks better. We've thoroughly examined over sixty recent studies to measure how effective these AI tools are at identifying and fighting a wide range of cyber threats. Our research includes a diverse array of cyberattacks such as malware attacks, network intrusions, spam, and others, showing that ML and DL methods, together with metaheuristic algorithms, significantly improve how well we can find and respond to cyber threats. We compare these AI methods to find out what they're good at and where they could improve, especially as we face new and changing cyber-attacks. This paper presents a straightforward framework for assessing AI Methods in cyber threat detection. Given the increasing complexity of cyber threats, enhancing AI methods and regularly ensuring strong protection is critical. We evaluate the effectiveness and the limitations of current ML and DL proposed models, in addition to the metaheuristic algorithms. Recognizing these limitations is vital for guiding future enhancements. We're pushing for smart and flexible solutions that can adapt to new challenges. The findings from our research suggest that the future of protecting against cyber-attacks will rely on continuously updating AI methods to stay ahead of hackers' latest tricks.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"42 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0