Pub Date : 2025-05-15DOI: 10.1016/j.bdr.2025.100542
Domenica Fioredistella Iezzi , Roberto Monte
In recent years, electronic word of mouth has become a significant factor in purchasing decisions, with consumers' sentiments playing a crucial role in shaping the sales of products and services.
This paper introduces a novel approach to sales forecasting that addresses consumers' sentiments toward goods or services by combining the sales volume time series with a quantitative proxy of the unobservable true sentiment. Numerous studies have explored various methods to capture sentiment and accurately predict sales. We have integrated an estimated sentiment signal, variously built via lexicon-based, machine-learning, and deep-learning approaches, into a multivariate autoregressive state space (MARSS) model. We have tested our model on a dataset of 163,000 tweets about the Toyota Camry, covering the period from June 2009 to December 2022 and sales volumes in the US market over the same timeframe.
{"title":"E-word of mouth in sales volume forecasting: Toyota Camry case study","authors":"Domenica Fioredistella Iezzi , Roberto Monte","doi":"10.1016/j.bdr.2025.100542","DOIUrl":"10.1016/j.bdr.2025.100542","url":null,"abstract":"<div><div>In recent years, electronic word of mouth has become a significant factor in purchasing decisions, with consumers' sentiments playing a crucial role in shaping the sales of products and services.</div><div>This paper introduces a novel approach to sales forecasting that addresses consumers' sentiments toward goods or services by combining the sales volume time series with a quantitative proxy of the unobservable true sentiment. Numerous studies have explored various methods to capture sentiment and accurately predict sales. We have integrated an estimated sentiment signal, variously built via lexicon-based, machine-learning, and deep-learning approaches, into a multivariate autoregressive state space (MARSS) model. We have tested our model on a dataset of 163,000 tweets about the Toyota Camry, covering the period from June 2009 to December 2022 and sales volumes in the US market over the same timeframe.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100542"},"PeriodicalIF":3.5,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-15DOI: 10.1016/j.bdr.2025.100539
Cristian Usala, Isabella Sulis, Mariano Porcu
This study investigates the determinants of tertiary education success in Italy, focusing on students' outcomes between the first and second years. We use population data of students enrolled between 2015 and 2019, integrating information on high school environments and degree program characteristics. This rich dataset has been exploited with a two-step approach: the first step defines indicators for high school quality and degree program difficulty; the second estimates a multinomial logit to assess the determinants of students' probability of being classified as regulars, churners, at risk of dropout, and dropouts. Data regarding the 2019 cohort have been further investigated by exploiting the additional information on students' socioeconomic backgrounds and schools' self-assessed effectiveness evaluations. Results indicate that students' high school backgrounds, socioeconomic conditions, and post-graduation prospects in terms of net wages and occupation rates of graduates in the chosen degree program significantly influence academic success and students' academic persistence. Overall, the results offer a comprehensive view of the determinants of university success, with specific patterns observed across the different student categories.
{"title":"Exploring the impact of high schools, socioeconomic factors, and degree programs on higher education success in Italy","authors":"Cristian Usala, Isabella Sulis, Mariano Porcu","doi":"10.1016/j.bdr.2025.100539","DOIUrl":"10.1016/j.bdr.2025.100539","url":null,"abstract":"<div><div>This study investigates the determinants of tertiary education success in Italy, focusing on students' outcomes between the first and second years. We use population data of students enrolled between 2015 and 2019, integrating information on high school environments and degree program characteristics. This rich dataset has been exploited with a two-step approach: the first step defines indicators for high school quality and degree program difficulty; the second estimates a multinomial logit to assess the determinants of students' probability of being classified as regulars, churners, at risk of dropout, and dropouts. Data regarding the 2019 cohort have been further investigated by exploiting the additional information on students' socioeconomic backgrounds and schools' self-assessed effectiveness evaluations. Results indicate that students' high school backgrounds, socioeconomic conditions, and post-graduation prospects in terms of net wages and occupation rates of graduates in the chosen degree program significantly influence academic success and students' academic persistence. Overall, the results offer a comprehensive view of the determinants of university success, with specific patterns observed across the different student categories.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100539"},"PeriodicalIF":3.5,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In an era where digital technologies such as AI, cloud computing and IoT are reshaping global business dynamics, the digital transformation of enterprises has become a pivotal factor for maintaining competitive advantage. This paper provides an in-depth analysis of the digitalization process among Italian firms, leveraging data from the ISTAT ICT survey. Using a fuzzy set approach, we develop a refined index to measure technological deprivation across multiple dimensions, providing a detailed understanding of how digitalization is adopted at the firm level. The results indicate a moderate level of technological development among firms. The dimension related to online sales emerges as the most underdeveloped, highlighting it as a critical area for improvement for Italian companies and underscoring the need for targeted policy interventions to bridge these digital gaps. Moreover, the analysis reveals significant disparities across sectors, geographic areas, and firm sizes, with smaller enterprises and those in certain regions exhibiting lower levels of digital adoption. Our study underscores the utility of the fuzzy set methodology for analyzing high-dimensional big data and provides actionable insights for enhancing digital adoption among firms in Italy.
{"title":"Business digitalization in Italy: A comprehensive analysis using supplementary fuzzy set approach","authors":"Ilaria Benedetti, Federico Crescenzi, Tiziana Laureti, Niccolò Salvini","doi":"10.1016/j.bdr.2025.100538","DOIUrl":"10.1016/j.bdr.2025.100538","url":null,"abstract":"<div><div>In an era where digital technologies such as AI, cloud computing and IoT are reshaping global business dynamics, the digital transformation of enterprises has become a pivotal factor for maintaining competitive advantage. This paper provides an in-depth analysis of the digitalization process among Italian firms, leveraging data from the ISTAT ICT survey. Using a fuzzy set approach, we develop a refined index to measure technological deprivation across multiple dimensions, providing a detailed understanding of how digitalization is adopted at the firm level. The results indicate a moderate level of technological development among firms. The dimension related to online sales emerges as the most underdeveloped, highlighting it as a critical area for improvement for Italian companies and underscoring the need for targeted policy interventions to bridge these digital gaps. Moreover, the analysis reveals significant disparities across sectors, geographic areas, and firm sizes, with smaller enterprises and those in certain regions exhibiting lower levels of digital adoption. Our study underscores the utility of the fuzzy set methodology for analyzing high-dimensional big data and provides actionable insights for enhancing digital adoption among firms in Italy.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100538"},"PeriodicalIF":3.5,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-10DOI: 10.1016/j.bdr.2025.100534
Jingze Zhou , Salem Alkhalaf , S. Abdel-Khalek , Shah Nazir
The convergence of edge computing and 5G network speed provides an innovative way to address the energy efficiency and low latency requirements in medical data processing, especially from the perspective of the Internet of Medical Things (IoMT). Together, these technologies allow for the quick and effective handling of the enormous volumes of medical data produced by different IoMT devices in the context of smart healthcare systems. The IoMT is bringing cutting-edge technologies, social benefits, and economic advantages to transform modern healthcare systems entirely. Digital healthcare is transforming due to machine learning, which uses sophisticated algorithms to forecast patients’ health status efficiently. These approaches predict the onset of disease, hospital readmissions, and treatment customization by analyzing large medical datasets. Strong data security and good forecast accuracy are still issues. The quality and variety of training data are key factors in making accurate predictions, and strict encryption, safe storage, and regulatory compliance are necessary for data security. By including various significant components from existing research, the current study seeks to determine the most collective features. The goal of the study is to offer a systematic approach for assessing these features identified by using the approaches of the AHP and WASPAS. These approaches are effective for efficient big data analytics in the context of smart home energy management system based on IOMT.
{"title":"Big data analytics for smart home energy management system based on IOMT using AHP and WASPAS","authors":"Jingze Zhou , Salem Alkhalaf , S. Abdel-Khalek , Shah Nazir","doi":"10.1016/j.bdr.2025.100534","DOIUrl":"10.1016/j.bdr.2025.100534","url":null,"abstract":"<div><div>The convergence of edge computing and 5G network speed provides an innovative way to address the energy efficiency and low latency requirements in medical data processing, especially from the perspective of the Internet of Medical Things (IoMT). Together, these technologies allow for the quick and effective handling of the enormous volumes of medical data produced by different IoMT devices in the context of smart healthcare systems. The IoMT is bringing cutting-edge technologies, social benefits, and economic advantages to transform modern healthcare systems entirely. Digital healthcare is transforming due to machine learning, which uses sophisticated algorithms to forecast patients’ health status efficiently. These approaches predict the onset of disease, hospital readmissions, and treatment customization by analyzing large medical datasets. Strong data security and good forecast accuracy are still issues. The quality and variety of training data are key factors in making accurate predictions, and strict encryption, safe storage, and regulatory compliance are necessary for data security. By including various significant components from existing research, the current study seeks to determine the most collective features. The goal of the study is to offer a systematic approach for assessing these features identified by using the approaches of the AHP and WASPAS. These approaches are effective for efficient big data analytics in the context of smart home energy management system based on IOMT.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100534"},"PeriodicalIF":3.5,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-10DOI: 10.1016/j.bdr.2025.100536
Mengting Yu, Luca Secondi, Tiziana Laureti, Luigi Palumbo
Food surplus, fit for consumption, is often excluded from the consumption loop for commercial reasons, leading to wasted food, nutrients, resources, and costs. Digital innovations with diverse business models aim to combat this through food redistribution. However, it is critical to assess their effectiveness from stakeholder and consumer perspectives, meanwhile, new research focuses on the value of these business models.
This study employs web scraping technology to collect multi-dimensional data from two Italian cities on Too Good To Go. The analysis results confirm its positive contribution to food surplus redistribution with economic benefits, despite a weaker presence of certain food establishment types and a lack of social motivation among consumers. Furthermore, strong business-customer relationships can be established when businesses commit to reducing food waste and effectively communicate with their customers using the platform.
适合消费的剩余粮食往往因商业原因被排除在消费循环之外,导致粮食、营养、资源和成本的浪费。具有多种商业模式的数字创新旨在通过粮食再分配来解决这一问题。然而,从利益相关者和消费者的角度评估其有效性是至关重要的,同时,新的研究侧重于这些商业模式的价值。本研究采用网络抓取技术收集了意大利两个城市在Too Good to Go的多维数据。分析结果证实了它对粮食剩余再分配的积极贡献,并具有经济效益,尽管某些食品企业类型的存在较弱,消费者缺乏社会动机。此外,当企业承诺减少食物浪费并使用该平台与客户有效沟通时,可以建立牢固的企业与客户关系。
{"title":"Saving food surplus and developing new business models: Exploring the potential of ‘Too Good To Go’ at territorial level using web-scraped data","authors":"Mengting Yu, Luca Secondi, Tiziana Laureti, Luigi Palumbo","doi":"10.1016/j.bdr.2025.100536","DOIUrl":"10.1016/j.bdr.2025.100536","url":null,"abstract":"<div><div>Food surplus, fit for consumption, is often excluded from the consumption loop for commercial reasons, leading to wasted food, nutrients, resources, and costs. Digital innovations with diverse business models aim to combat this through food redistribution. However, it is critical to assess their effectiveness from stakeholder and consumer perspectives, meanwhile, new research focuses on the value of these business models.</div><div>This study employs web scraping technology to collect multi-dimensional data from two Italian cities on <em>Too Good To Go</em>. The analysis results confirm its positive contribution to food surplus redistribution with economic benefits, despite a weaker presence of certain food establishment types and a lack of social motivation among consumers. Furthermore, strong business-customer relationships can be established when businesses commit to reducing food waste and effectively communicate with their customers using the platform.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100536"},"PeriodicalIF":3.5,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143947341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-22DOI: 10.1016/j.bdr.2025.100531
Yasong Chen , Wen Li, Junjian Zhao
As a nonlinear extension of Non-negative Matrix Factorization (NMF), Kernel Non-negative Matrix Factorization (KNMF) has demonstrated greater effectiveness in revealing latent features from raw data. Building on this, this paper introduces kernel theory and effectively combines the advantages of semi-nonnegative constraints, graph regularization, and orthogonal subspace constraints to propose a novel model-Kernel Graph Regularized Semi-Negative Matrix Factorization with Orthogonal Subspaces and Auxiliary Variables (semi-KGNMFOSV). This model introduces auxiliary variables and reformulates the optimization problem, successfully overcoming the convergence proof challenges typically associated with orthogonal subspace-constrained methods. Furthermore, the model utilizes kernel methods to effectively capture complex nonlinear structures in the data. The semi-nonnegative constraint, along with orthogonal subspace constraints incorporating auxiliary variables, enhances optimization efficiency, while graph regularization preserves the local geometric structure of the data. We develop an efficient optimization algorithm to solve the proposed model and conduct extensive experiments on multiple real-world datasets. Additionally, we investigate the impact of three different initialization strategies on the performance of the proposed algorithm. Experimental results demonstrate that, compared to classical and state-of-the-art methods, the proposed model exhibits superior performance across all three initialization strategies.
{"title":"A novel study of kernel graph regularized semi-non-negative matrix factorization with orthogonal subspace for clustering","authors":"Yasong Chen , Wen Li, Junjian Zhao","doi":"10.1016/j.bdr.2025.100531","DOIUrl":"10.1016/j.bdr.2025.100531","url":null,"abstract":"<div><div>As a nonlinear extension of Non-negative Matrix Factorization (NMF), Kernel Non-negative Matrix Factorization (KNMF) has demonstrated greater effectiveness in revealing latent features from raw data. Building on this, this paper introduces kernel theory and effectively combines the advantages of semi-nonnegative constraints, graph regularization, and orthogonal subspace constraints to propose a novel model-Kernel Graph Regularized Semi-Negative Matrix Factorization with Orthogonal Subspaces and Auxiliary Variables (semi-KGNMFOSV). This model introduces auxiliary variables and reformulates the optimization problem, successfully overcoming the convergence proof challenges typically associated with orthogonal subspace-constrained methods. Furthermore, the model utilizes kernel methods to effectively capture complex nonlinear structures in the data. The semi-nonnegative constraint, along with orthogonal subspace constraints incorporating auxiliary variables, enhances optimization efficiency, while graph regularization preserves the local geometric structure of the data. We develop an efficient optimization algorithm to solve the proposed model and conduct extensive experiments on multiple real-world datasets. Additionally, we investigate the impact of three different initialization strategies on the performance of the proposed algorithm. Experimental results demonstrate that, compared to classical and state-of-the-art methods, the proposed model exhibits superior performance across all three initialization strategies.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100531"},"PeriodicalIF":3.5,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143863357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-17DOI: 10.1016/j.bdr.2025.100532
Li Gao, Hongjun Li, Qingkui Chen, Dunlu Peng
In recent years, with the rapid development of deep learning, big data mining, and natural language processing (NLP) technologies, the application of NLP in the field of recommendation systems has attracted significant attention. However, current text recommendation systems still face challenges in handling word distribution assumptions, preprocessing design, network inference models, and text perception technologies. Traditional RNN neural network layers often encounter issues such as gradient explosion or vanishing gradients, which hinder their ability to effectively handle long-term dependencies and reverse text inference among long texts. Therefore, this paper proposes a new type of depth-aware neural network recommendation model (Hourglass Deep-aware neural network Recommendation Model, HDARM), whose structure presents an hourglass shape. This model consists of three parts: The top of the hourglass uses Word Embedding for input through Fine-tune Bert to process text embeddings as word distribution assumptions, followed by utilizing bidirectional LSTM to integrate Transformer models for learning critical information. The middle of the hourglass retains key features of network outputs through CNN layers, which are combined with pooling layers to extract and enhance critical information from user text. The bottom of the hourglass avoids a decline in generalization performance through deep neural network layers. Finally, the model performs pattern matching between text vectors and word embeddings, recommending texts based on their relevance. In experiments, this model improved metrics like MSE and NDCG@10 by 8.74 % and 10.89 % respectively compared to the optimal baseline model.
{"title":"Hourglass pattern matching for deep aware neural network text recommendation model","authors":"Li Gao, Hongjun Li, Qingkui Chen, Dunlu Peng","doi":"10.1016/j.bdr.2025.100532","DOIUrl":"10.1016/j.bdr.2025.100532","url":null,"abstract":"<div><div>In recent years, with the rapid development of deep learning, big data mining, and natural language processing (NLP) technologies, the application of NLP in the field of recommendation systems has attracted significant attention. However, current text recommendation systems still face challenges in handling word distribution assumptions, preprocessing design, network inference models, and text perception technologies. Traditional RNN neural network layers often encounter issues such as gradient explosion or vanishing gradients, which hinder their ability to effectively handle long-term dependencies and reverse text inference among long texts. Therefore, this paper proposes a new type of depth-aware neural network recommendation model (Hourglass Deep-aware neural network Recommendation Model, HDARM), whose structure presents an hourglass shape. This model consists of three parts: The top of the hourglass uses Word Embedding for input through Fine-tune Bert to process text embeddings as word distribution assumptions, followed by utilizing bidirectional LSTM to integrate Transformer models for learning critical information. The middle of the hourglass retains key features of network outputs through CNN layers, which are combined with pooling layers to extract and enhance critical information from user text. The bottom of the hourglass avoids a decline in generalization performance through deep neural network layers. Finally, the model performs pattern matching between text vectors and word embeddings, recommending texts based on their relevance. In experiments, this model improved metrics like MSE and NDCG@10 by 8.74 % and 10.89 % respectively compared to the optimal baseline model.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100532"},"PeriodicalIF":3.5,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143923599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-14DOI: 10.1016/j.bdr.2025.100530
Shaobo Deng, Weili Yuan, Sujie Guan, Xing Lin, Zemin Liao, Min Li
Constructing an optimal decision tree remains a challenging task. Existing algorithms often utilize power coefficient methods or standardization techniques to weight the entropy value; however, these approaches do not sufficiently account for the importance of attributes. This paper introduces an Adaptive Entropy Decision Tree (EWDT) algorithm, which leverages eigenvalue importance and integrates singular value decomposition into the calculation of entropy values. Experimental results demonstrate that the proposed algorithm outperforms other decision tree algorithms in terms of accuracy, precision, recall, and F1-score.
{"title":"A decision tree algorithm based on adaptive entropy of feature value importance","authors":"Shaobo Deng, Weili Yuan, Sujie Guan, Xing Lin, Zemin Liao, Min Li","doi":"10.1016/j.bdr.2025.100530","DOIUrl":"10.1016/j.bdr.2025.100530","url":null,"abstract":"<div><div>Constructing an optimal decision tree remains a challenging task. Existing algorithms often utilize power coefficient methods or standardization techniques to weight the entropy value; however, these approaches do not sufficiently account for the importance of attributes. This paper introduces an Adaptive Entropy Decision Tree (EWDT) algorithm, which leverages eigenvalue importance and integrates singular value decomposition into the calculation of entropy values. Experimental results demonstrate that the proposed algorithm outperforms other decision tree algorithms in terms of accuracy, precision, recall, and F1-score.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100530"},"PeriodicalIF":3.5,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143899918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-09DOI: 10.1016/j.bdr.2025.100528
Haitao He , Ke Liu , Lei Zhang , Ke Xu , Jiazheng Li , Jiadong Ren
With the development of network security research, intrusion detection systems based on deep learning show great potential in network attack detection. As crucial tools for ensuring network information security, these systems themselves are vulnerable to poisoning attacks from attackers. Currently, most poisoning attack defense methods cannot effectively utilize network traffic characteristics and are only effective for specific models, showing poor defense results for other models. Furthermore, detection of poisoning attacks is often overlooked, leading to a lack of timely and effective defense against such attacks. Therefore, we propose a data poisoning defense mechanism called TE-PADN. Firstly, we introduce a temporal margin sample generation algorithm that integrates an attention mechanism. Based on mapping the original data time series into a latent feature space, this algorithm learns the temporal characteristics of the data and focuses on information from different positions using the attention mechanism to generate temporal margin samples for repairing poisoned models. Secondly, we propose a multi-level poisoning attack detection method for real-time and accurate detection of undetected poisoning attacks. By employing ensemble learning methods, this approach enhances model robustness, repairs model classification boundaries that have shifted due to poisoning attacks and achieves efficient defense against poisoning attacks. Finally, experimental validation of our proposed method demonstrates promising results. Under a 10% attack intensity, the average accuracy of TE-PADN in recovering poisoning models increased by 6.5% on the NSL-KDD dataset, 5.3% on the UNSW-NB15 dataset, and 5.9% on the CICIDS2017 dataset.
{"title":"TE-PADN: A poisoning attack defense model based on temporal margin samples","authors":"Haitao He , Ke Liu , Lei Zhang , Ke Xu , Jiazheng Li , Jiadong Ren","doi":"10.1016/j.bdr.2025.100528","DOIUrl":"10.1016/j.bdr.2025.100528","url":null,"abstract":"<div><div>With the development of network security research, intrusion detection systems based on deep learning show great potential in network attack detection. As crucial tools for ensuring network information security, these systems themselves are vulnerable to poisoning attacks from attackers. Currently, most poisoning attack defense methods cannot effectively utilize network traffic characteristics and are only effective for specific models, showing poor defense results for other models. Furthermore, detection of poisoning attacks is often overlooked, leading to a lack of timely and effective defense against such attacks. Therefore, we propose a data poisoning defense mechanism called TE-PADN. Firstly, we introduce a temporal margin sample generation algorithm that integrates an attention mechanism. Based on mapping the original data time series into a latent feature space, this algorithm learns the temporal characteristics of the data and focuses on information from different positions using the attention mechanism to generate temporal margin samples for repairing poisoned models. Secondly, we propose a multi-level poisoning attack detection method for real-time and accurate detection of undetected poisoning attacks. By employing ensemble learning methods, this approach enhances model robustness, repairs model classification boundaries that have shifted due to poisoning attacks and achieves efficient defense against poisoning attacks. Finally, experimental validation of our proposed method demonstrates promising results. Under a 10% attack intensity, the average accuracy of TE-PADN in recovering poisoning models increased by 6.5% on the NSL-KDD dataset, 5.3% on the UNSW-NB15 dataset, and 5.9% on the CICIDS2017 dataset.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100528"},"PeriodicalIF":3.5,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143816452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-08DOI: 10.1016/j.bdr.2025.100529
Ehsan Ahmadi, Reza Maihami
The COVID-19 pandemic revealed significant limitations in traditional approaches to analyzing time-series data that use one-dimensional data such as historical infection rates. Such approaches do not capture the complex, multifactor influences on disease spread. This paper addresses these challenges by proposing a comprehensive methodology that integrates multiple data sources, including community mobility, census information, Google search trends, socioeconomic variables, vaccination coverage, and political data. In addition, this paper proposes a new cross-learning (CL) methodology that allows for the training of machine learning models on multiple related time series simultaneously, enabling more accurate and robust predictions. Applying the CL approach with four machine learning algorithms, we successfully forecasted confirmed COVID-19 cases 30 days in advance with greater accuracy than the traditional ARIMAX model and the newer Transformer deep learning technique. Our findings identified daily hospital admissions as a significant predictor at the state level and vaccination status at the national level. Random Forest with CL was very effective, performing best in 44 states, while ARIMAX outperformed in seven larger states. These findings highlight the importance of advanced predictive modeling in resource optimization and response strategy development for future health emergencies.
{"title":"Leveraging artificial intelligence for pandemic management: Case of COVID-19 in the United States","authors":"Ehsan Ahmadi, Reza Maihami","doi":"10.1016/j.bdr.2025.100529","DOIUrl":"10.1016/j.bdr.2025.100529","url":null,"abstract":"<div><div>The COVID-19 pandemic revealed significant limitations in traditional approaches to analyzing time-series data that use one-dimensional data such as historical infection rates. Such approaches do not capture the complex, multifactor influences on disease spread. This paper addresses these challenges by proposing a comprehensive methodology that integrates multiple data sources, including community mobility, census information, Google search trends, socioeconomic variables, vaccination coverage, and political data. In addition, this paper proposes a new cross-learning (CL) methodology that allows for the training of machine learning models on multiple related time series simultaneously, enabling more accurate and robust predictions. Applying the CL approach with four machine learning algorithms, we successfully forecasted confirmed COVID-19 cases 30 days in advance with greater accuracy than the traditional ARIMAX model and the newer Transformer deep learning technique. Our findings identified daily hospital admissions as a significant predictor at the state level and vaccination status at the national level. Random Forest with CL was very effective, performing best in 44 states, while ARIMAX outperformed in seven larger states. These findings highlight the importance of advanced predictive modeling in resource optimization and response strategy development for future health emergencies.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100529"},"PeriodicalIF":3.5,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}