Pub Date : 2025-10-30DOI: 10.1016/j.bdr.2025.100570
Mei Wang , Hai-Ning Liang , Yu Liu , Chengtao Ji , Lingyun Yu
In this study, we aim to explore an interactive system that integrates visual metaphors, AI-powered essay scoring techniques, and tangible feedback to enhance students' English language learning experience. Over the past decade, AI has made significant strides across various domains, including education. A prominent example of this is the integration of AI-driven language learning tools featuring Automated Essay Scoring (AES) systems. Traditionally, AES relied on predefined criteria and provided scores in simple text formats, which often lack depth and fail to engage students in understanding their progress or areas for improvement. To address these limitations and enhance learnability, we propose a system that harnesses AI-powered AES with a visualization approach. Our system includes three main components: an AI-driven scoring algorithm, a visualization interface translating scoring outcomes into visual metaphors, and tangible postcards for presenting scores. To evaluate the usage of our visualization system and tangible-formatted feedback in practice, we conducted domain expert interviews and a three-stage user study. The results indicate that the progressive visual feedback and tangible postcards increased practice frequency and significantly boosted study motivation. Tangible visual feedback showed positive effects on fostering progressive learning. Through this study, we recognized the potential of combining AI, visual metaphors, and tangible feedback in English education to encourage continuous and active learning.
{"title":"Tangible progress: Employing visual metaphors and physical interfaces in AI-based English language learning","authors":"Mei Wang , Hai-Ning Liang , Yu Liu , Chengtao Ji , Lingyun Yu","doi":"10.1016/j.bdr.2025.100570","DOIUrl":"10.1016/j.bdr.2025.100570","url":null,"abstract":"<div><div>In this study, we aim to explore an interactive system that integrates visual metaphors, AI-powered essay scoring techniques, and tangible feedback to enhance students' English language learning experience. Over the past decade, AI has made significant strides across various domains, including education. A prominent example of this is the integration of AI-driven language learning tools featuring Automated Essay Scoring (AES) systems. Traditionally, AES relied on predefined criteria and provided scores in simple text formats, which often lack depth and fail to engage students in understanding their progress or areas for improvement. To address these limitations and enhance learnability, we propose a system that harnesses AI-powered AES with a visualization approach. Our system includes three main components: an AI-driven scoring algorithm, a visualization interface translating scoring outcomes into visual metaphors, and tangible postcards for presenting scores. To evaluate the usage of our visualization system and tangible-formatted feedback in practice, we conducted domain expert interviews and a three-stage user study. The results indicate that the progressive visual feedback and tangible postcards increased practice frequency and significantly boosted study motivation. Tangible visual feedback showed positive effects on fostering progressive learning. Through this study, we recognized the potential of combining AI, visual metaphors, and tangible feedback in English education to encourage continuous and active learning.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100570"},"PeriodicalIF":4.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1016/j.bdr.2025.100569
G.Y. Chandan , Prity Kumari
This study investigates price forecasting model for cotton in Gujarat, India, using daily modal prices and arrival data sourced from Agmarknet spanning April 2002 to April 2023. Given the volatile and nonlinear nature of agricultural prices, this research integrates exogenous variables through statistical and advanced deep learning models to enhance predictive accuracy. The models tested include the Autoregressive Integrated Moving Average with Exogenous variables (ARIMAX), Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM) and Stacked LSTM. Results reveal that Stacked LSTM model outperforms traditional statistical and basic neural network models, achieving the lowest values in accuracy metrics like Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Symmetric Mean Absolute Percentage Error (SMAPE). With 365 days ahead forecast horizon, Stacked LSTM model yielded an error of 9.30% during pre-sowing season (May-June 2023) and 13.75% in harvesting season (October-November 2023). This precision in capturing seasonal price fluctuations can be attributed to the integration of relevant exogenous variables, which enhance the model’s ability to account for external market influences affecting cotton prices in Gujarat.
{"title":"Exogenous variable driven cotton prices prediction: comparison of statistical model with sequence based deep learning models","authors":"G.Y. Chandan , Prity Kumari","doi":"10.1016/j.bdr.2025.100569","DOIUrl":"10.1016/j.bdr.2025.100569","url":null,"abstract":"<div><div>This study investigates price forecasting model for cotton in Gujarat, India, using daily modal prices and arrival data sourced from Agmarknet spanning April 2002 to April 2023. Given the volatile and nonlinear nature of agricultural prices, this research integrates exogenous variables through statistical and advanced deep learning models to enhance predictive accuracy. The models tested include the Autoregressive Integrated Moving Average with Exogenous variables (ARIMAX), Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM) and Stacked LSTM. Results reveal that Stacked LSTM model outperforms traditional statistical and basic neural network models, achieving the lowest values in accuracy metrics like Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Symmetric Mean Absolute Percentage Error (SMAPE). With 365 days ahead forecast horizon, Stacked LSTM model yielded an error of 9.30% during pre-sowing season (May-June 2023) and 13.75% in harvesting season (October-November 2023). This precision in capturing seasonal price fluctuations can be attributed to the integration of relevant exogenous variables, which enhance the model’s ability to account for external market influences affecting cotton prices in Gujarat.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100569"},"PeriodicalIF":4.2,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1016/j.bdr.2025.100568
Haoran Gong, Lei Lei, Shan Ma, Chunyu Qiu
Meteorological data is closely related to everyone's daily life, and accurate weather forecasting is crucial for many socio-economic activities. However, as a typical spatio-temporal data type, the complex temporal nonlinearity and spatial dependencies in meteorological data greatly increase the difficulty of forecasting. This paper proposes a neural network model, STED (Spatio-Temporal Data Encoder-Decoder), based on an encoder-decoder architecture, which effectively handles the temporal dynamics of long time series and high-precision spatial dependencies. STED consists of three modules: a spatial encoder-decoder, a temporal encoder-decoder, and a predictor. The spatial encoder-decoder extracts spatial features, the temporal encoder-decoder extracts temporal features, and the predictor is used for forecasting. Experimental results show that STED performs similarly to current state-of-the-art (SoTA) spatio-temporal forecasting models in short-term temperature prediction tasks, but significantly outperforms other models in medium- and long-term temperature prediction tasks. Additionally, this paper compares different spatial encoder-decoders for forecasting tasks with varying node scales. The experimental results demonstrate that, for small-scale node tasks, the spatial encoder-decoder based on multilayer perceptrons achieves good accuracy and efficiency. In contrast, for large-scale node tasks, the spatial encoder-decoder based on convolutional neural networks exhibits superior performance.
气象数据与每个人的日常生活密切相关,准确的天气预报对许多社会经济活动至关重要。然而,气象数据作为一种典型的时空数据类型,其复杂的时间非线性和空间依赖性极大地增加了预测的难度。本文提出了一种基于编码器-解码器结构的神经网络模型——时空数据编码器-解码器(spatial - temporal Data Encoder-Decoder),该模型能有效地处理长时间序列的时间动态和高精度的空间依赖关系。STED由三个模块组成:空间编码器-解码器,时间编码器-解码器和预测器。空间编解码器提取空间特征,时间编解码器提取时间特征,预测器用于预测。实验结果表明,STED在短期温度预测任务中的表现与当前SoTA时空预测模型相似,但在中长期温度预测任务中表现明显优于其他模型。此外,本文还比较了不同空间编码器在不同节点尺度下的预测任务。实验结果表明,对于小规模节点任务,基于多层感知器的空间编解码器具有良好的精度和效率。相比之下,对于大规模节点任务,基于卷积神经网络的空间编解码器表现出优越的性能。
{"title":"STED: An encoder-decoder architecture for long-term spatio-temporal weather forecasting","authors":"Haoran Gong, Lei Lei, Shan Ma, Chunyu Qiu","doi":"10.1016/j.bdr.2025.100568","DOIUrl":"10.1016/j.bdr.2025.100568","url":null,"abstract":"<div><div>Meteorological data is closely related to everyone's daily life, and accurate weather forecasting is crucial for many socio-economic activities. However, as a typical spatio-temporal data type, the complex temporal nonlinearity and spatial dependencies in meteorological data greatly increase the difficulty of forecasting. This paper proposes a neural network model, STED (Spatio-Temporal Data Encoder-Decoder), based on an encoder-decoder architecture, which effectively handles the temporal dynamics of long time series and high-precision spatial dependencies. STED consists of three modules: a spatial encoder-decoder, a temporal encoder-decoder, and a predictor. The spatial encoder-decoder extracts spatial features, the temporal encoder-decoder extracts temporal features, and the predictor is used for forecasting. Experimental results show that STED performs similarly to current state-of-the-art (SoTA) spatio-temporal forecasting models in short-term temperature prediction tasks, but significantly outperforms other models in medium- and long-term temperature prediction tasks. Additionally, this paper compares different spatial encoder-decoders for forecasting tasks with varying node scales. The experimental results demonstrate that, for small-scale node tasks, the spatial encoder-decoder based on multilayer perceptrons achieves good accuracy and efficiency. In contrast, for large-scale node tasks, the spatial encoder-decoder based on convolutional neural networks exhibits superior performance.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100568"},"PeriodicalIF":4.2,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145364825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09DOI: 10.1016/j.bdr.2025.100567
Yongpeng Yang , Zhenzhen Yang
In intelligent city, traffic forecasting has played a significant role in intelligent transportation system. Nowadays, many methods, which combine spectral graph neural network and self-attention, are proposed. However, they still have some limitations for traffic forecasting: 1) The polynomial basis of traditional spectral graph neural networks (GNN) is fixed, which limits their ability to learn spatial dependency of traffic data. 2) Some GNNs ignore the dynamic dependency of traffic data. 3) Traditional self-attention suffers from limited perception for long-term information, time delay, and global information. These defaults pose big challenge for traffic forecasting via limiting their ability of capturing spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. From this perspective, we propose an adaptive spectral GNN and frequency enhanced self-attention (ASGFES) for traffic forecasting, which can effectively capture the spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. Specifically, we first introduce an adaptive spectral graph neural network (ASGNN) for effectively capturing the spatial dependency via conducting adaptive polynomial basis. In addition, two dynamic long and short range attentive graphs are fed into the ASGNN for emphasizing the dynamicity in view of long and short range. Secondly, we introduce a normalized self-attention with damped exponential moving average (NSADEMA). Specifically, the normalized self-attention (NSA) can capture the necessary expressivity to learn all-pair interactions without the need for some extra operation such as positional encodings, multi-head operations, and so on. It can well obtain the temporal dependency and heterogeneity of traffic data. In addition, the DEMA, which is equipped into NSA, can enhance the perception for the inductive bias of traffic data in time domain. It can be aware of the time delay of traffic data. Thirdly, linear frequency learner with time-series decomposition (LFLTD) are developed for enhancing the ability of capturing the temporal dependency and heterogeneity. Specifically, time-series decomposition (TSD) facilitates the analysis and forecasting of complex time via capturing various hidden components such as the trend and seasonal components. Meanwhile, linear frequency learner (LFL) can learn global dependencies and concentrating on important part of frequency components with compact signal energy. At last, many experiments are performed on several public traffic datasets and demonstrate the proposed ASGFES can achieve better performance than other traffic forecasting methods.
{"title":"Adaptive spectral GNN and frequency enhanced self-attention for traffic forecasting","authors":"Yongpeng Yang , Zhenzhen Yang","doi":"10.1016/j.bdr.2025.100567","DOIUrl":"10.1016/j.bdr.2025.100567","url":null,"abstract":"<div><div>In intelligent city, traffic forecasting has played a significant role in intelligent transportation system. Nowadays, many methods, which combine spectral graph neural network and self-attention, are proposed. However, they still have some limitations for traffic forecasting: 1) The polynomial basis of traditional spectral graph neural networks (GNN) is fixed, which limits their ability to learn spatial dependency of traffic data. 2) Some GNNs ignore the dynamic dependency of traffic data. 3) Traditional self-attention suffers from limited perception for long-term information, time delay, and global information. These defaults pose big challenge for traffic forecasting via limiting their ability of capturing spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. From this perspective, we propose an adaptive spectral GNN and frequency enhanced self-attention (ASGFES) for traffic forecasting, which can effectively capture the spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. Specifically, we first introduce an adaptive spectral graph neural network (ASGNN) for effectively capturing the spatial dependency via conducting adaptive polynomial basis. In addition, two dynamic long and short range attentive graphs are fed into the ASGNN for emphasizing the dynamicity in view of long and short range. Secondly, we introduce a normalized self-attention with damped exponential moving average (NSADEMA). Specifically, the normalized self-attention (NSA) can capture the necessary expressivity to learn all-pair interactions without the need for some extra operation such as positional encodings, multi-head operations, and so on. It can well obtain the temporal dependency and heterogeneity of traffic data. In addition, the DEMA, which is equipped into NSA, can enhance the perception for the inductive bias of traffic data in time domain. It can be aware of the time delay of traffic data. Thirdly, linear frequency learner with time-series decomposition (LFLTD) are developed for enhancing the ability of capturing the temporal dependency and heterogeneity. Specifically, time-series decomposition (TSD) facilitates the analysis and forecasting of complex time via capturing various hidden components such as the trend and seasonal components. Meanwhile, linear frequency learner (LFL) can learn global dependencies and concentrating on important part of frequency components with compact signal energy. At last, many experiments are performed on several public traffic datasets and demonstrate the proposed ASGFES can achieve better performance than other traffic forecasting methods.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100567"},"PeriodicalIF":4.2,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145271154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1016/j.bdr.2025.100556
Sobia Tariq Javed , Kashif Zafar , Irfan Younas
The rapid advancement of technology has led to the generation of big data. This vast and diverse data can uncover valuable patterns and yield promising results when effectively mined, processed, and analyzed. However, it also introduces the “curse of dimensionality,” which can negatively impact the performance of machine learning models. Feature Selection (FS) is a data preprocessing technique aimed at identifying the optimal feature set to enhance model efficiency and reduce processing time. Numerous metaheuristic wrapper-based FS techniques have been explored in the literature. However, a significant drawback of many of these algorithms is their dependence on centralized learning, where the global best solution drives the search direction. This centralized approach is risky, as any error by the global best can hinder the exploration and exploitation of other potential areas, leading to inaccuracies in discovering the true global optimum. In this paper, the binary variant of a novel decentralized metaheuristic Kids Learning Optimization Algorithm (KLO) called Binary Kids Learning Optimization Algorithm (BKLO) is proposed for optimal feature selection for classification purposes in wrapper mode. The continuous solutions of KLO are converted to binary space by using the transfer function. A comparison is provided between the two transfer functions: hyperbolic tan (V-shaped) and the Sigmoidal (S-shaped) transfer functions. BKLO is compared with seven state-of-the-art algorithms. The performance of algorithms is evaluated and compared using several assessment indicators over fifteen benchmark datasets with a wide range of dimensions (small, medium, and large) from the University of California Irvine (UCI) repository and Arizona State University. The superiority of BKLO in reducing the number of features with increased classification accuracy over the other competing algorithms is demonstrated through the experiments and Friedman's Mean Rank (FMR) statistical tests.
科技的飞速发展导致了大数据的产生。这些庞大而多样的数据可以发现有价值的模式,并在有效地挖掘、处理和分析时产生有希望的结果。然而,它也引入了“维度诅咒”,这可能会对机器学习模型的性能产生负面影响。特征选择(FS)是一种旨在识别最优特征集以提高模型效率和减少处理时间的数据预处理技术。许多基于元启发式包装的FS技术已经在文献中进行了探索。然而,许多这些算法的一个重大缺点是它们依赖于集中学习,其中全局最优解驱动搜索方向。这种集中的方法是有风险的,因为全局最优的任何错误都可能阻碍对其他潜在区域的探索和开发,从而导致发现真正的全局最优的不准确性。本文提出了一种新的去中心化元启发式儿童学习优化算法(KLO)的二进制变体,称为二进制儿童学习优化算法(BKLO),用于在包装器模式下进行分类目的的最优特征选择。利用传递函数将KLO的连续解转换为二进制空间。比较了两种传递函数:双曲tan (v形)和s形(s形)传递函数。BKLO与7种最先进的算法进行了比较。算法的性能通过来自加州大学欧文分校(UCI)存储库和亚利桑那州立大学的15个具有广泛维度(小、中、大)的基准数据集的几个评估指标进行评估和比较。通过实验和Friedman's Mean Rank (FMR)统计检验,证明了BKLO在减少特征数量和提高分类精度方面优于其他竞争算法。
{"title":"A decentralized metaheuristic approach to feature selection inspired by social interactions within a societal framework, for handling datasets of diverse sizes","authors":"Sobia Tariq Javed , Kashif Zafar , Irfan Younas","doi":"10.1016/j.bdr.2025.100556","DOIUrl":"10.1016/j.bdr.2025.100556","url":null,"abstract":"<div><div>The rapid advancement of technology has led to the generation of big data. This vast and diverse data can uncover valuable patterns and yield promising results when effectively mined, processed, and analyzed. However, it also introduces the “curse of dimensionality,” which can negatively impact the performance of machine learning models. Feature Selection (FS) is a data preprocessing technique aimed at identifying the optimal feature set to enhance model efficiency and reduce processing time. Numerous metaheuristic wrapper-based FS techniques have been explored in the literature. However, a significant drawback of many of these algorithms is their dependence on centralized learning, where the global best solution drives the search direction. This centralized approach is risky, as any error by the global best can hinder the exploration and exploitation of other potential areas, leading to inaccuracies in discovering the true global optimum. In this paper, the binary variant of a novel decentralized metaheuristic Kids Learning Optimization Algorithm (KLO) called <strong>Binary Kids Learning Optimization Algorithm (BKLO)</strong> is proposed for optimal feature selection for classification purposes in wrapper mode. The continuous solutions of KLO are converted to binary space by using the transfer function. A comparison is provided between the two transfer functions: hyperbolic tan (V-shaped) and the Sigmoidal (S-shaped) transfer functions. BKLO is compared with seven state-of-the-art algorithms. The performance of algorithms is evaluated and compared using several assessment indicators over fifteen benchmark datasets with a wide range of dimensions (small, medium, and large) from the University of California Irvine (UCI) repository and Arizona State University. The superiority of BKLO in reducing the number of features with increased classification accuracy over the other competing algorithms is demonstrated through the experiments and Friedman's Mean Rank (FMR) statistical tests.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100556"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144903932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20DOI: 10.1016/j.bdr.2025.100554
Keren Li , Wenqiang Zhang , Dandan Xiao , Peng Hou , Shuai Yan , Yang Wang , Xuerui Mao
To address the storage challenges stemming from large volumes of heterogeneous data in wind farms, we propose a data compression technique based on tensor train decomposition (TTD). Initially, we establish a tensor-based processing model to standardize the heterogeneous data originating from wind farms, which includes both structured SCADA (supervisory control and data acquisition) data and unstructured video and picture data. Subsequently, we introduce a TTD-based method designed to compress the heterogeneous data generated in wind farms while preserving the inherent spatial eigenstructure of the data. Finally, we validate the efficacy of the proposed method in alleviating data storage challenges by utilizing authentic wind farm datasets. Comparative analysis reveals that the TTD-based method outperforms previously proposed compression techniques, specifically the canonical polyadic (CP) and Tucker methods.
{"title":"Compression of big data collected in wind farm based on tensor train decomposition","authors":"Keren Li , Wenqiang Zhang , Dandan Xiao , Peng Hou , Shuai Yan , Yang Wang , Xuerui Mao","doi":"10.1016/j.bdr.2025.100554","DOIUrl":"10.1016/j.bdr.2025.100554","url":null,"abstract":"<div><div>To address the storage challenges stemming from large volumes of heterogeneous data in wind farms, we propose a data compression technique based on tensor train decomposition (TTD). Initially, we establish a tensor-based processing model to standardize the heterogeneous data originating from wind farms, which includes both structured SCADA (supervisory control and data acquisition) data and unstructured video and picture data. Subsequently, we introduce a TTD-based method designed to compress the heterogeneous data generated in wind farms while preserving the inherent spatial eigenstructure of the data. Finally, we validate the efficacy of the proposed method in alleviating data storage challenges by utilizing authentic wind farm datasets. Comparative analysis reveals that the TTD-based method outperforms previously proposed compression techniques, specifically the canonical polyadic (CP) and Tucker methods.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100554"},"PeriodicalIF":4.2,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1016/j.bdr.2025.100555
Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani
Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.
{"title":"Explainable malware detection through integrated graph reduction and learning techniques","authors":"Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani","doi":"10.1016/j.bdr.2025.100555","DOIUrl":"10.1016/j.bdr.2025.100555","url":null,"abstract":"<div><div>Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100555"},"PeriodicalIF":4.2,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-18DOI: 10.1016/j.bdr.2025.100558
Yong Li , Jingpeng Wu , Zhongying Zhang
Link prediction is a paradigmatic problem with tremendous real-world applications in network science, which aims to infer missing links or future links based on currently observed partial nodes and links. However, conventional link prediction models are based on network structure, with relatively low prediction accuracy and lack universality and scalability. The performance of link prediction based on machine learning and artificial features is greatly influenced by subjective consciousness. Although graph embedding learning (GEL) models can avoid these shortcomings, it still poses some challenges. Because GEL models are generally based on random walks and graph neural networks (GNNs), their prediction accuracy is relatively ineffective, making them unsuitable for revealing hidden information in node featureless networks. To address these challenges, we present NGLinker, a new link prediction model based on Node2vec and GraphSage, which can reconcile the performance and accuracy in a node featureless network. Rather than learning node features with label information, NGLinker depends only on the local network structure. Quantitatively, we observe superior prediction accuracy of NGLinker and lab test imputations compared to the state-of-the-art models, which strongly supports that using NGLinker to predict three public networks and one private network and then conduct prediction results is feasible and effective. The NGLinker can not only achieve prediction accuracy in terms of precision and area under the receiver operating characteristic curve (AUC) but also acquire strong universality and scalability. The NGLinker model enlarges the application of the GNNs to node featureless networks.
{"title":"NGLinker: Link prediction for node featureless networks","authors":"Yong Li , Jingpeng Wu , Zhongying Zhang","doi":"10.1016/j.bdr.2025.100558","DOIUrl":"10.1016/j.bdr.2025.100558","url":null,"abstract":"<div><div>Link prediction is a paradigmatic problem with tremendous real-world applications in network science, which aims to infer missing links or future links based on currently observed partial nodes and links. However, conventional link prediction models are based on network structure, with relatively low prediction accuracy and lack universality and scalability. The performance of link prediction based on machine learning and artificial features is greatly influenced by subjective consciousness. Although graph embedding learning (GEL) models can avoid these shortcomings, it still poses some challenges. Because GEL models are generally based on random walks and graph neural networks (GNNs), their prediction accuracy is relatively ineffective, making them unsuitable for revealing hidden information in node featureless networks. To address these challenges, we present NGLinker, a new link prediction model based on Node2vec and GraphSage, which can reconcile the performance and accuracy in a node featureless network. Rather than learning node features with label information, NGLinker depends only on the local network structure. Quantitatively, we observe superior prediction accuracy of NGLinker and lab test imputations compared to the state-of-the-art models, which strongly supports that using NGLinker to predict three public networks and one private network and then conduct prediction results is feasible and effective. The NGLinker can not only achieve prediction accuracy in terms of precision and area under the receiver operating characteristic curve (AUC) but also acquire strong universality and scalability. The NGLinker model enlarges the application of the GNNs to node featureless networks.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100558"},"PeriodicalIF":4.2,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-13DOI: 10.1016/j.bdr.2025.100557
Luping Zhi , Wanmin Wang
Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.
{"title":"Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN","authors":"Luping Zhi , Wanmin Wang","doi":"10.1016/j.bdr.2025.100557","DOIUrl":"10.1016/j.bdr.2025.100557","url":null,"abstract":"<div><div>Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100557"},"PeriodicalIF":4.2,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-09DOI: 10.1016/j.bdr.2025.100553
Zheng Fang , Toby Cai
Modeling stock returns has often relied on multivariate time series analysis, and constructing an accurate model remains a challenging goal for both market investors and academic researchers. Stock return prediction typically involves multiple variables and a combination of long-term and short-term time series patterns. In this paper, we propose a new deep learning network, named DLS-TS-Net, to model stock returns and address this challenge. We apply DLS-TS-Net in multivariate time series forecasting. The network integrates a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) units, and Gated Recurrent Units (GRUs). DLS-TS-Net overcomes LSTM's insensitivity to linear components in stock market forecasting by incorporating a traditional autoregressive model. Experimental results demonstrate that DLS-TS-Net excels at capturing long-term trends in multivariate factors and short-term fluctuations in the stock market, outperforming traditional time series and machine learning models. Additionally, when combined with the investment strategies proposed in this paper, DLS-TS-Net shows superior performance in managing risk during extreme events
{"title":"Deep neural network modeling for financial time series analysis","authors":"Zheng Fang , Toby Cai","doi":"10.1016/j.bdr.2025.100553","DOIUrl":"10.1016/j.bdr.2025.100553","url":null,"abstract":"<div><div>Modeling stock returns has often relied on multivariate time series analysis, and constructing an accurate model remains a challenging goal for both market investors and academic researchers. Stock return prediction typically involves multiple variables and a combination of long-term and short-term time series patterns. In this paper, we propose a new deep learning network, named DLS-TS-Net, to model stock returns and address this challenge. We apply DLS-TS-Net in multivariate time series forecasting. The network integrates a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) units, and Gated Recurrent Units (GRUs). DLS-TS-Net overcomes LSTM's insensitivity to linear components in stock market forecasting by incorporating a traditional autoregressive model. Experimental results demonstrate that DLS-TS-Net excels at capturing long-term trends in multivariate factors and short-term fluctuations in the stock market, outperforming traditional time series and machine learning models. Additionally, when combined with the investment strategies proposed in this paper, DLS-TS-Net shows superior performance in managing risk during extreme events</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100553"},"PeriodicalIF":3.5,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144263987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}