Pub Date : 2025-11-11DOI: 10.1016/j.bdr.2025.100572
Jiachen Xie , Jiwei Qin , Xizhong Qin , Daishun Cui , Qiang Li , Dezhi Sun
Carbon dioxide (CO2) emissions play a crucial role in driving global climate change. Precise and reliable predictions of CO2 emission trends are instrumental in fostering sustainable development and realizing dual-carbon goals. Due to complex human activities, economic development, and meteorological factors, accurate long-term time series prediction of CO2 emissions encounters numerous challenges, such as the long-term temporal dependencies and complicated non-linear correlation in long-term time series CO2 emissions. To address these challenges, we propose a long-term time series CO2 emissions prediction model called CarbonLinear. The proposed CarbonLinear is based on Multilayer Perceptron that can better capture non-linear relationships in long-term time series CO2 emissions through a flexible network structure and deep connectivity. The proposed CarbonLinear employs an adaptive global-local multiscale integrated modeling architecture to mitigate the data distribution shift problem adaptively. In addition, the proposed CarbonLinear introduces a sequence segmentation module, which allows CarbonLinear to model local features in a long-term time series CO2 emissions and improves the computational efficiency of the model. Experimental results show that the proposed CarbonLinear performs well on CO2 emissions datasets from multiple regions, significantly improving over other models. The proposed CarbonLinear provides scientists and policymakers with a more accurate and reliable tool for CO2 emissions prediction.
{"title":"Research on adaptive long-term time series carbon dioxide emission prediction model based on improved multilayer perceptron","authors":"Jiachen Xie , Jiwei Qin , Xizhong Qin , Daishun Cui , Qiang Li , Dezhi Sun","doi":"10.1016/j.bdr.2025.100572","DOIUrl":"10.1016/j.bdr.2025.100572","url":null,"abstract":"<div><div>Carbon dioxide (CO<sub>2</sub>) emissions play a crucial role in driving global climate change. Precise and reliable predictions of CO<sub>2</sub> emission trends are instrumental in fostering sustainable development and realizing dual-carbon goals. Due to complex human activities, economic development, and meteorological factors, accurate long-term time series prediction of CO<sub>2</sub> emissions encounters numerous challenges, such as the long-term temporal dependencies and complicated non-linear correlation in long-term time series CO<sub>2</sub> emissions. To address these challenges, we propose a long-term time series CO<sub>2</sub> emissions prediction model called CarbonLinear. The proposed CarbonLinear is based on Multilayer Perceptron that can better capture non-linear relationships in long-term time series CO<sub>2</sub> emissions through a flexible network structure and deep connectivity. The proposed CarbonLinear employs an adaptive global-local multiscale integrated modeling architecture to mitigate the data distribution shift problem adaptively. In addition, the proposed CarbonLinear introduces a sequence segmentation module, which allows CarbonLinear to model local features in a long-term time series CO<sub>2</sub> emissions and improves the computational efficiency of the model. Experimental results show that the proposed CarbonLinear performs well on CO<sub>2</sub> emissions datasets from multiple regions, significantly improving over other models. The proposed CarbonLinear provides scientists and policymakers with a more accurate and reliable tool for CO<sub>2</sub> emissions prediction.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100572"},"PeriodicalIF":4.2,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1016/j.bdr.2025.100573
Lucas Correia , Jan-Christoph Goos , Thomas Bäck , Anna V. Kononova
Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. Additionally, our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data. Furthermore, results show that the threshold used can have a large influence on detection performance, hence more work needs to be invested in methods to find a suitable threshold without the need for labelled data.
{"title":"PATH: A discrete-sequence dataset for evaluating online unsupervised anomaly detection approaches for multivariate time series","authors":"Lucas Correia , Jan-Christoph Goos , Thomas Bäck , Anna V. Kononova","doi":"10.1016/j.bdr.2025.100573","DOIUrl":"10.1016/j.bdr.2025.100573","url":null,"abstract":"<div><div>Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. Additionally, our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data. Furthermore, results show that the threshold used can have a large influence on detection performance, hence more work needs to be invested in methods to find a suitable threshold without the need for labelled data.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100573"},"PeriodicalIF":4.2,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1016/j.bdr.2025.100571
S.M. Mehzabeen, R Gayathri
Effective pest management often involves the use of appropriate pesticides, and early identification of pests is essential for protecting crops. Timely and accurate identification of pests using deep learning-based approaches have gained traction as effective solutions for addressing agricultural challenges, including the detection of plant diseases and pests. Thus, a pest recognition model is developed to improve the productivity of the crops based on effective deep learning method. This model is implemented by accumulating the required input images, which are then fed into the developed Region Vision Transformer-Yolov8 (RegViT- Yolov8) model for pest detection. Then, the RegionViT features of all Bounded Boxes are extracted from the detected outcome and concatenated together. Further, Principal Component Analysis (PCA) based feature reduction is executed by considering the concatenated features as input. Following feature reduction, pest classification is performed by using the developed Adaptive Residual Bidirectional Gated Recurrent Unit (AR-BiGRU). Moreover, the classification accuracy is enhanced by optimizing the system parameters using an Advanced Random Variable-based Preschool Education Optimization Algorithm (ARV-PEOA). Thus, the effectiveness of this framework is validated with diverse measures and the attained outcome is compared with the existing techniques to showcase its efficiency. While considering the accuracy measure, the proposed model has attained 96.7 % accurate result on the analysis based on 500 hidden neurons.
{"title":"Effective adaptive res-BiGRU network for pest classification performance based on regionViT-yolov8-aided pest detection technique","authors":"S.M. Mehzabeen, R Gayathri","doi":"10.1016/j.bdr.2025.100571","DOIUrl":"10.1016/j.bdr.2025.100571","url":null,"abstract":"<div><div>Effective pest management often involves the use of appropriate pesticides, and early identification of pests is essential for protecting crops. Timely and accurate identification of pests using deep learning-based approaches have gained traction as effective solutions for addressing agricultural challenges, including the detection of plant diseases and pests. Thus, a pest recognition model is developed to improve the productivity of the crops based on effective deep learning method. This model is implemented by accumulating the required input images, which are then fed into the developed Region Vision Transformer-Yolov8 (RegViT- Yolov8) model for pest detection. Then, the RegionViT features of all Bounded Boxes are extracted from the detected outcome and concatenated together. Further, Principal Component Analysis (PCA) based feature reduction is executed by considering the concatenated features as input. Following feature reduction, pest classification is performed by using the developed Adaptive Residual Bidirectional Gated Recurrent Unit (AR-BiGRU). Moreover, the classification accuracy is enhanced by optimizing the system parameters using an Advanced Random Variable-based Preschool Education Optimization Algorithm (ARV-PEOA). Thus, the effectiveness of this framework is validated with diverse measures and the attained outcome is compared with the existing techniques to showcase its efficiency. While considering the accuracy measure, the proposed model has attained 96.7 % accurate result on the analysis based on 500 hidden neurons.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100571"},"PeriodicalIF":4.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1016/j.bdr.2025.100570
Mei Wang , Hai-Ning Liang , Yu Liu , Chengtao Ji , Lingyun Yu
In this study, we aim to explore an interactive system that integrates visual metaphors, AI-powered essay scoring techniques, and tangible feedback to enhance students' English language learning experience. Over the past decade, AI has made significant strides across various domains, including education. A prominent example of this is the integration of AI-driven language learning tools featuring Automated Essay Scoring (AES) systems. Traditionally, AES relied on predefined criteria and provided scores in simple text formats, which often lack depth and fail to engage students in understanding their progress or areas for improvement. To address these limitations and enhance learnability, we propose a system that harnesses AI-powered AES with a visualization approach. Our system includes three main components: an AI-driven scoring algorithm, a visualization interface translating scoring outcomes into visual metaphors, and tangible postcards for presenting scores. To evaluate the usage of our visualization system and tangible-formatted feedback in practice, we conducted domain expert interviews and a three-stage user study. The results indicate that the progressive visual feedback and tangible postcards increased practice frequency and significantly boosted study motivation. Tangible visual feedback showed positive effects on fostering progressive learning. Through this study, we recognized the potential of combining AI, visual metaphors, and tangible feedback in English education to encourage continuous and active learning.
{"title":"Tangible progress: Employing visual metaphors and physical interfaces in AI-based English language learning","authors":"Mei Wang , Hai-Ning Liang , Yu Liu , Chengtao Ji , Lingyun Yu","doi":"10.1016/j.bdr.2025.100570","DOIUrl":"10.1016/j.bdr.2025.100570","url":null,"abstract":"<div><div>In this study, we aim to explore an interactive system that integrates visual metaphors, AI-powered essay scoring techniques, and tangible feedback to enhance students' English language learning experience. Over the past decade, AI has made significant strides across various domains, including education. A prominent example of this is the integration of AI-driven language learning tools featuring Automated Essay Scoring (AES) systems. Traditionally, AES relied on predefined criteria and provided scores in simple text formats, which often lack depth and fail to engage students in understanding their progress or areas for improvement. To address these limitations and enhance learnability, we propose a system that harnesses AI-powered AES with a visualization approach. Our system includes three main components: an AI-driven scoring algorithm, a visualization interface translating scoring outcomes into visual metaphors, and tangible postcards for presenting scores. To evaluate the usage of our visualization system and tangible-formatted feedback in practice, we conducted domain expert interviews and a three-stage user study. The results indicate that the progressive visual feedback and tangible postcards increased practice frequency and significantly boosted study motivation. Tangible visual feedback showed positive effects on fostering progressive learning. Through this study, we recognized the potential of combining AI, visual metaphors, and tangible feedback in English education to encourage continuous and active learning.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100570"},"PeriodicalIF":4.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1016/j.bdr.2025.100569
G.Y. Chandan , Prity Kumari
This study investigates price forecasting model for cotton in Gujarat, India, using daily modal prices and arrival data sourced from Agmarknet spanning April 2002 to April 2023. Given the volatile and nonlinear nature of agricultural prices, this research integrates exogenous variables through statistical and advanced deep learning models to enhance predictive accuracy. The models tested include the Autoregressive Integrated Moving Average with Exogenous variables (ARIMAX), Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM) and Stacked LSTM. Results reveal that Stacked LSTM model outperforms traditional statistical and basic neural network models, achieving the lowest values in accuracy metrics like Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Symmetric Mean Absolute Percentage Error (SMAPE). With 365 days ahead forecast horizon, Stacked LSTM model yielded an error of 9.30% during pre-sowing season (May-June 2023) and 13.75% in harvesting season (October-November 2023). This precision in capturing seasonal price fluctuations can be attributed to the integration of relevant exogenous variables, which enhance the model’s ability to account for external market influences affecting cotton prices in Gujarat.
{"title":"Exogenous variable driven cotton prices prediction: comparison of statistical model with sequence based deep learning models","authors":"G.Y. Chandan , Prity Kumari","doi":"10.1016/j.bdr.2025.100569","DOIUrl":"10.1016/j.bdr.2025.100569","url":null,"abstract":"<div><div>This study investigates price forecasting model for cotton in Gujarat, India, using daily modal prices and arrival data sourced from Agmarknet spanning April 2002 to April 2023. Given the volatile and nonlinear nature of agricultural prices, this research integrates exogenous variables through statistical and advanced deep learning models to enhance predictive accuracy. The models tested include the Autoregressive Integrated Moving Average with Exogenous variables (ARIMAX), Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM) and Stacked LSTM. Results reveal that Stacked LSTM model outperforms traditional statistical and basic neural network models, achieving the lowest values in accuracy metrics like Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Symmetric Mean Absolute Percentage Error (SMAPE). With 365 days ahead forecast horizon, Stacked LSTM model yielded an error of 9.30% during pre-sowing season (May-June 2023) and 13.75% in harvesting season (October-November 2023). This precision in capturing seasonal price fluctuations can be attributed to the integration of relevant exogenous variables, which enhance the model’s ability to account for external market influences affecting cotton prices in Gujarat.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100569"},"PeriodicalIF":4.2,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1016/j.bdr.2025.100568
Haoran Gong, Lei Lei, Shan Ma, Chunyu Qiu
Meteorological data is closely related to everyone's daily life, and accurate weather forecasting is crucial for many socio-economic activities. However, as a typical spatio-temporal data type, the complex temporal nonlinearity and spatial dependencies in meteorological data greatly increase the difficulty of forecasting. This paper proposes a neural network model, STED (Spatio-Temporal Data Encoder-Decoder), based on an encoder-decoder architecture, which effectively handles the temporal dynamics of long time series and high-precision spatial dependencies. STED consists of three modules: a spatial encoder-decoder, a temporal encoder-decoder, and a predictor. The spatial encoder-decoder extracts spatial features, the temporal encoder-decoder extracts temporal features, and the predictor is used for forecasting. Experimental results show that STED performs similarly to current state-of-the-art (SoTA) spatio-temporal forecasting models in short-term temperature prediction tasks, but significantly outperforms other models in medium- and long-term temperature prediction tasks. Additionally, this paper compares different spatial encoder-decoders for forecasting tasks with varying node scales. The experimental results demonstrate that, for small-scale node tasks, the spatial encoder-decoder based on multilayer perceptrons achieves good accuracy and efficiency. In contrast, for large-scale node tasks, the spatial encoder-decoder based on convolutional neural networks exhibits superior performance.
气象数据与每个人的日常生活密切相关,准确的天气预报对许多社会经济活动至关重要。然而,气象数据作为一种典型的时空数据类型,其复杂的时间非线性和空间依赖性极大地增加了预测的难度。本文提出了一种基于编码器-解码器结构的神经网络模型——时空数据编码器-解码器(spatial - temporal Data Encoder-Decoder),该模型能有效地处理长时间序列的时间动态和高精度的空间依赖关系。STED由三个模块组成:空间编码器-解码器,时间编码器-解码器和预测器。空间编解码器提取空间特征,时间编解码器提取时间特征,预测器用于预测。实验结果表明,STED在短期温度预测任务中的表现与当前SoTA时空预测模型相似,但在中长期温度预测任务中表现明显优于其他模型。此外,本文还比较了不同空间编码器在不同节点尺度下的预测任务。实验结果表明,对于小规模节点任务,基于多层感知器的空间编解码器具有良好的精度和效率。相比之下,对于大规模节点任务,基于卷积神经网络的空间编解码器表现出优越的性能。
{"title":"STED: An encoder-decoder architecture for long-term spatio-temporal weather forecasting","authors":"Haoran Gong, Lei Lei, Shan Ma, Chunyu Qiu","doi":"10.1016/j.bdr.2025.100568","DOIUrl":"10.1016/j.bdr.2025.100568","url":null,"abstract":"<div><div>Meteorological data is closely related to everyone's daily life, and accurate weather forecasting is crucial for many socio-economic activities. However, as a typical spatio-temporal data type, the complex temporal nonlinearity and spatial dependencies in meteorological data greatly increase the difficulty of forecasting. This paper proposes a neural network model, STED (Spatio-Temporal Data Encoder-Decoder), based on an encoder-decoder architecture, which effectively handles the temporal dynamics of long time series and high-precision spatial dependencies. STED consists of three modules: a spatial encoder-decoder, a temporal encoder-decoder, and a predictor. The spatial encoder-decoder extracts spatial features, the temporal encoder-decoder extracts temporal features, and the predictor is used for forecasting. Experimental results show that STED performs similarly to current state-of-the-art (SoTA) spatio-temporal forecasting models in short-term temperature prediction tasks, but significantly outperforms other models in medium- and long-term temperature prediction tasks. Additionally, this paper compares different spatial encoder-decoders for forecasting tasks with varying node scales. The experimental results demonstrate that, for small-scale node tasks, the spatial encoder-decoder based on multilayer perceptrons achieves good accuracy and efficiency. In contrast, for large-scale node tasks, the spatial encoder-decoder based on convolutional neural networks exhibits superior performance.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100568"},"PeriodicalIF":4.2,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145364825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09DOI: 10.1016/j.bdr.2025.100567
Yongpeng Yang , Zhenzhen Yang
In intelligent city, traffic forecasting has played a significant role in intelligent transportation system. Nowadays, many methods, which combine spectral graph neural network and self-attention, are proposed. However, they still have some limitations for traffic forecasting: 1) The polynomial basis of traditional spectral graph neural networks (GNN) is fixed, which limits their ability to learn spatial dependency of traffic data. 2) Some GNNs ignore the dynamic dependency of traffic data. 3) Traditional self-attention suffers from limited perception for long-term information, time delay, and global information. These defaults pose big challenge for traffic forecasting via limiting their ability of capturing spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. From this perspective, we propose an adaptive spectral GNN and frequency enhanced self-attention (ASGFES) for traffic forecasting, which can effectively capture the spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. Specifically, we first introduce an adaptive spectral graph neural network (ASGNN) for effectively capturing the spatial dependency via conducting adaptive polynomial basis. In addition, two dynamic long and short range attentive graphs are fed into the ASGNN for emphasizing the dynamicity in view of long and short range. Secondly, we introduce a normalized self-attention with damped exponential moving average (NSADEMA). Specifically, the normalized self-attention (NSA) can capture the necessary expressivity to learn all-pair interactions without the need for some extra operation such as positional encodings, multi-head operations, and so on. It can well obtain the temporal dependency and heterogeneity of traffic data. In addition, the DEMA, which is equipped into NSA, can enhance the perception for the inductive bias of traffic data in time domain. It can be aware of the time delay of traffic data. Thirdly, linear frequency learner with time-series decomposition (LFLTD) are developed for enhancing the ability of capturing the temporal dependency and heterogeneity. Specifically, time-series decomposition (TSD) facilitates the analysis and forecasting of complex time via capturing various hidden components such as the trend and seasonal components. Meanwhile, linear frequency learner (LFL) can learn global dependencies and concentrating on important part of frequency components with compact signal energy. At last, many experiments are performed on several public traffic datasets and demonstrate the proposed ASGFES can achieve better performance than other traffic forecasting methods.
{"title":"Adaptive spectral GNN and frequency enhanced self-attention for traffic forecasting","authors":"Yongpeng Yang , Zhenzhen Yang","doi":"10.1016/j.bdr.2025.100567","DOIUrl":"10.1016/j.bdr.2025.100567","url":null,"abstract":"<div><div>In intelligent city, traffic forecasting has played a significant role in intelligent transportation system. Nowadays, many methods, which combine spectral graph neural network and self-attention, are proposed. However, they still have some limitations for traffic forecasting: 1) The polynomial basis of traditional spectral graph neural networks (GNN) is fixed, which limits their ability to learn spatial dependency of traffic data. 2) Some GNNs ignore the dynamic dependency of traffic data. 3) Traditional self-attention suffers from limited perception for long-term information, time delay, and global information. These defaults pose big challenge for traffic forecasting via limiting their ability of capturing spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. From this perspective, we propose an adaptive spectral GNN and frequency enhanced self-attention (ASGFES) for traffic forecasting, which can effectively capture the spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. Specifically, we first introduce an adaptive spectral graph neural network (ASGNN) for effectively capturing the spatial dependency via conducting adaptive polynomial basis. In addition, two dynamic long and short range attentive graphs are fed into the ASGNN for emphasizing the dynamicity in view of long and short range. Secondly, we introduce a normalized self-attention with damped exponential moving average (NSADEMA). Specifically, the normalized self-attention (NSA) can capture the necessary expressivity to learn all-pair interactions without the need for some extra operation such as positional encodings, multi-head operations, and so on. It can well obtain the temporal dependency and heterogeneity of traffic data. In addition, the DEMA, which is equipped into NSA, can enhance the perception for the inductive bias of traffic data in time domain. It can be aware of the time delay of traffic data. Thirdly, linear frequency learner with time-series decomposition (LFLTD) are developed for enhancing the ability of capturing the temporal dependency and heterogeneity. Specifically, time-series decomposition (TSD) facilitates the analysis and forecasting of complex time via capturing various hidden components such as the trend and seasonal components. Meanwhile, linear frequency learner (LFL) can learn global dependencies and concentrating on important part of frequency components with compact signal energy. At last, many experiments are performed on several public traffic datasets and demonstrate the proposed ASGFES can achieve better performance than other traffic forecasting methods.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100567"},"PeriodicalIF":4.2,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145271154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1016/j.bdr.2025.100556
Sobia Tariq Javed , Kashif Zafar , Irfan Younas
The rapid advancement of technology has led to the generation of big data. This vast and diverse data can uncover valuable patterns and yield promising results when effectively mined, processed, and analyzed. However, it also introduces the “curse of dimensionality,” which can negatively impact the performance of machine learning models. Feature Selection (FS) is a data preprocessing technique aimed at identifying the optimal feature set to enhance model efficiency and reduce processing time. Numerous metaheuristic wrapper-based FS techniques have been explored in the literature. However, a significant drawback of many of these algorithms is their dependence on centralized learning, where the global best solution drives the search direction. This centralized approach is risky, as any error by the global best can hinder the exploration and exploitation of other potential areas, leading to inaccuracies in discovering the true global optimum. In this paper, the binary variant of a novel decentralized metaheuristic Kids Learning Optimization Algorithm (KLO) called Binary Kids Learning Optimization Algorithm (BKLO) is proposed for optimal feature selection for classification purposes in wrapper mode. The continuous solutions of KLO are converted to binary space by using the transfer function. A comparison is provided between the two transfer functions: hyperbolic tan (V-shaped) and the Sigmoidal (S-shaped) transfer functions. BKLO is compared with seven state-of-the-art algorithms. The performance of algorithms is evaluated and compared using several assessment indicators over fifteen benchmark datasets with a wide range of dimensions (small, medium, and large) from the University of California Irvine (UCI) repository and Arizona State University. The superiority of BKLO in reducing the number of features with increased classification accuracy over the other competing algorithms is demonstrated through the experiments and Friedman's Mean Rank (FMR) statistical tests.
科技的飞速发展导致了大数据的产生。这些庞大而多样的数据可以发现有价值的模式,并在有效地挖掘、处理和分析时产生有希望的结果。然而,它也引入了“维度诅咒”,这可能会对机器学习模型的性能产生负面影响。特征选择(FS)是一种旨在识别最优特征集以提高模型效率和减少处理时间的数据预处理技术。许多基于元启发式包装的FS技术已经在文献中进行了探索。然而,许多这些算法的一个重大缺点是它们依赖于集中学习,其中全局最优解驱动搜索方向。这种集中的方法是有风险的,因为全局最优的任何错误都可能阻碍对其他潜在区域的探索和开发,从而导致发现真正的全局最优的不准确性。本文提出了一种新的去中心化元启发式儿童学习优化算法(KLO)的二进制变体,称为二进制儿童学习优化算法(BKLO),用于在包装器模式下进行分类目的的最优特征选择。利用传递函数将KLO的连续解转换为二进制空间。比较了两种传递函数:双曲tan (v形)和s形(s形)传递函数。BKLO与7种最先进的算法进行了比较。算法的性能通过来自加州大学欧文分校(UCI)存储库和亚利桑那州立大学的15个具有广泛维度(小、中、大)的基准数据集的几个评估指标进行评估和比较。通过实验和Friedman's Mean Rank (FMR)统计检验,证明了BKLO在减少特征数量和提高分类精度方面优于其他竞争算法。
{"title":"A decentralized metaheuristic approach to feature selection inspired by social interactions within a societal framework, for handling datasets of diverse sizes","authors":"Sobia Tariq Javed , Kashif Zafar , Irfan Younas","doi":"10.1016/j.bdr.2025.100556","DOIUrl":"10.1016/j.bdr.2025.100556","url":null,"abstract":"<div><div>The rapid advancement of technology has led to the generation of big data. This vast and diverse data can uncover valuable patterns and yield promising results when effectively mined, processed, and analyzed. However, it also introduces the “curse of dimensionality,” which can negatively impact the performance of machine learning models. Feature Selection (FS) is a data preprocessing technique aimed at identifying the optimal feature set to enhance model efficiency and reduce processing time. Numerous metaheuristic wrapper-based FS techniques have been explored in the literature. However, a significant drawback of many of these algorithms is their dependence on centralized learning, where the global best solution drives the search direction. This centralized approach is risky, as any error by the global best can hinder the exploration and exploitation of other potential areas, leading to inaccuracies in discovering the true global optimum. In this paper, the binary variant of a novel decentralized metaheuristic Kids Learning Optimization Algorithm (KLO) called <strong>Binary Kids Learning Optimization Algorithm (BKLO)</strong> is proposed for optimal feature selection for classification purposes in wrapper mode. The continuous solutions of KLO are converted to binary space by using the transfer function. A comparison is provided between the two transfer functions: hyperbolic tan (V-shaped) and the Sigmoidal (S-shaped) transfer functions. BKLO is compared with seven state-of-the-art algorithms. The performance of algorithms is evaluated and compared using several assessment indicators over fifteen benchmark datasets with a wide range of dimensions (small, medium, and large) from the University of California Irvine (UCI) repository and Arizona State University. The superiority of BKLO in reducing the number of features with increased classification accuracy over the other competing algorithms is demonstrated through the experiments and Friedman's Mean Rank (FMR) statistical tests.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100556"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144903932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20DOI: 10.1016/j.bdr.2025.100554
Keren Li , Wenqiang Zhang , Dandan Xiao , Peng Hou , Shuai Yan , Yang Wang , Xuerui Mao
To address the storage challenges stemming from large volumes of heterogeneous data in wind farms, we propose a data compression technique based on tensor train decomposition (TTD). Initially, we establish a tensor-based processing model to standardize the heterogeneous data originating from wind farms, which includes both structured SCADA (supervisory control and data acquisition) data and unstructured video and picture data. Subsequently, we introduce a TTD-based method designed to compress the heterogeneous data generated in wind farms while preserving the inherent spatial eigenstructure of the data. Finally, we validate the efficacy of the proposed method in alleviating data storage challenges by utilizing authentic wind farm datasets. Comparative analysis reveals that the TTD-based method outperforms previously proposed compression techniques, specifically the canonical polyadic (CP) and Tucker methods.
{"title":"Compression of big data collected in wind farm based on tensor train decomposition","authors":"Keren Li , Wenqiang Zhang , Dandan Xiao , Peng Hou , Shuai Yan , Yang Wang , Xuerui Mao","doi":"10.1016/j.bdr.2025.100554","DOIUrl":"10.1016/j.bdr.2025.100554","url":null,"abstract":"<div><div>To address the storage challenges stemming from large volumes of heterogeneous data in wind farms, we propose a data compression technique based on tensor train decomposition (TTD). Initially, we establish a tensor-based processing model to standardize the heterogeneous data originating from wind farms, which includes both structured SCADA (supervisory control and data acquisition) data and unstructured video and picture data. Subsequently, we introduce a TTD-based method designed to compress the heterogeneous data generated in wind farms while preserving the inherent spatial eigenstructure of the data. Finally, we validate the efficacy of the proposed method in alleviating data storage challenges by utilizing authentic wind farm datasets. Comparative analysis reveals that the TTD-based method outperforms previously proposed compression techniques, specifically the canonical polyadic (CP) and Tucker methods.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100554"},"PeriodicalIF":4.2,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1016/j.bdr.2025.100555
Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani
Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.
{"title":"Explainable malware detection through integrated graph reduction and learning techniques","authors":"Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani","doi":"10.1016/j.bdr.2025.100555","DOIUrl":"10.1016/j.bdr.2025.100555","url":null,"abstract":"<div><div>Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100555"},"PeriodicalIF":4.2,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}