Pub Date : 2026-01-22DOI: 10.1016/j.bdr.2026.100591
Michael Franklin Mbouopda , Emille E.O. Ishida , Engelbert Mephu Nguifo , Emmanuel Gangler
Exploring the expansion history of the universe, understanding its evolutionary stages, and predicting its future evolution are important goals in astrophysics. Today, machine learning tools are used to help achieving these goals by analyzing transient sources, which are modeled as uncertain time series. Although black-box methods achieve appreciable performance, existing interpretable time series methods failed to obtain acceptable performance for this type of data. Furthermore, data uncertainty is rarely taken into account in these methods. In this work, we propose an uncertainty-aware subsequence based model which achieves a classification comparable to that of state-of-the-art methods. Unlike conformal learning which estimates model uncertainty on predictions, our method takes data uncertainty as additional input. Moreover, our approach is explainable-by-design, giving domain experts the ability to inspect the model and explain its predictions. The explainability of the proposed method has also the potential to inspire new developments in theoretical astrophysics modeling by suggesting important subsequences which depict details of light curve shapes. The dataset, the source code of our experiment, and the results are made available on a public repository.
{"title":"Explainable classification of astronomical uncertain time series","authors":"Michael Franklin Mbouopda , Emille E.O. Ishida , Engelbert Mephu Nguifo , Emmanuel Gangler","doi":"10.1016/j.bdr.2026.100591","DOIUrl":"10.1016/j.bdr.2026.100591","url":null,"abstract":"<div><div>Exploring the expansion history of the universe, understanding its evolutionary stages, and predicting its future evolution are important goals in astrophysics. Today, machine learning tools are used to help achieving these goals by analyzing transient sources, which are modeled as uncertain time series. Although <em>black-box</em> methods achieve appreciable performance, existing interpretable time series methods failed to obtain acceptable performance for this type of data. Furthermore, data uncertainty is rarely taken into account in these methods. In this work, we propose an uncertainty-aware subsequence based model which achieves a classification comparable to that of state-of-the-art methods. Unlike conformal learning which estimates model uncertainty on predictions, our method takes data uncertainty as additional input. Moreover, our approach is explainable-by-design, giving domain experts the ability to inspect the model and explain its predictions. The explainability of the proposed method has also the potential to inspire new developments in theoretical astrophysics modeling by suggesting important subsequences which depict details of light curve shapes. The dataset, the source code of our experiment, and the results are made available on a public repository.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100591"},"PeriodicalIF":4.2,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.bdr.2026.100590
Shahab Ghodsi, Ali Moeini
Users’ opinions are one of the most helpful decision-making criteria in online shopping. The importance of online reviews causes malicious users to write fake reviews, trying to deceive future customers. Thus, there is a need to distinguish bogus reviews from genuine ones and fraudulent users from honest ones. Nevertheless, the large volume of opinions makes the problem more challenging. Most studies focus on detecting opinion fraud while the data size is small (i.e. less than 6 million reviews), so their approaches do not fit well in the domain of massive datasets from the perspective of execution time and effectiveness. To meet this challenge, we propose a model with the following characteristics: (1) it runs on the review network, and thus is a general model. (2) it utilises an adapted version of the loopy belief propagation (LBP) algorithm to deduct fraudulent nodes. (3) it uses the Dempster-Shafer’s theory (evidence theory) to discover fake reviews. (4) it is implemented in Spark, making it capable of handling large datasets properly. Our experiments on the Amazon review dataset showed that the model is fast (returns the results on tens of millions of reviews in a few minutes) and effective (successfully detects fraudulent nodes and fake reviews)
{"title":"Opinion fraud detection on massive datasets by spark","authors":"Shahab Ghodsi, Ali Moeini","doi":"10.1016/j.bdr.2026.100590","DOIUrl":"10.1016/j.bdr.2026.100590","url":null,"abstract":"<div><div>Users’ opinions are one of the most helpful decision-making criteria in online shopping. The importance of online reviews causes malicious users to write fake reviews, trying to deceive future customers. Thus, there is a need to distinguish bogus reviews from genuine ones and fraudulent users from honest ones. Nevertheless, the large volume of opinions makes the problem more challenging. Most studies focus on detecting opinion fraud while the data size is small (i.e. less than 6 million reviews), so their approaches do not fit well in the domain of massive datasets from the perspective of execution time and effectiveness. To meet this challenge, we propose a model with the following characteristics: (1) it runs on the review network, and thus is a general model. (2) it utilises an adapted version of the loopy belief propagation (LBP) algorithm to deduct fraudulent nodes. (3) it uses the Dempster-Shafer’s theory (evidence theory) to discover fake reviews. (4) it is implemented in Spark, making it capable of handling large datasets properly. Our experiments on the Amazon review dataset showed that the model is fast (returns the results on tens of millions of reviews in a few minutes) and effective (successfully detects fraudulent nodes and fake reviews)</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100590"},"PeriodicalIF":4.2,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1016/j.bdr.2026.100589
Xingyu Li, Jinglei Liu
By grouping highly correlated data together, least squares regression (LSR) is widely applied in data analysis and clustering tasks. However, most traditional regression methods typically involve a significant amount of computation and are based on linear assumptions, which are difficult to scale to large-scale data. To address these challenges, we revisited the classical spectral clustering method, least squares regression, and proposed large-scale least squares regression based on fast spectral embedding (FSELSR). First, By creating a bipartite graph between data points and anchor points, the FSELSR method provides a low-dimensional representation of image data. This not only reduces the scale and decreases the computational complexity of processing large-scale data, but also helps to reveal the intrinsic structure of the data more. Second, we also introduce random Fourier feature mapping (RFFM) in FSELSR, which is extended to fast spectral embedding kernel LSR (FSEKLSR), thus improving the efficiency and clustering effect of FSEKLSR in processing complex nonlinear data. Finally, we provide the global optimal closed-form solutions for both models, making them easier to implement, train, and apply in practice. Extensive experiments were conducted on both real and synthetic datasets, and the results demonstrated the effectiveness and efficiency of the proposed method.
{"title":"Large-scale least squares regression based on fast spectral embedding and random Fourier feature mapping","authors":"Xingyu Li, Jinglei Liu","doi":"10.1016/j.bdr.2026.100589","DOIUrl":"10.1016/j.bdr.2026.100589","url":null,"abstract":"<div><div>By grouping highly correlated data together, least squares regression (LSR) is widely applied in data analysis and clustering tasks. However, most traditional regression methods typically involve a significant amount of computation and are based on linear assumptions, which are difficult to scale to large-scale data. To address these challenges, we revisited the classical spectral clustering method, least squares regression, and proposed large-scale least squares regression based on fast spectral embedding (FSELSR). First, By creating a bipartite graph between data points and anchor points, the FSELSR method provides a low-dimensional representation of image data. This not only reduces the scale and decreases the computational complexity of processing large-scale data, but also helps to reveal the intrinsic structure of the data more. Second, we also introduce random Fourier feature mapping (RFFM) in FSELSR, which is extended to fast spectral embedding kernel LSR (FSEKLSR), thus improving the efficiency and clustering effect of FSEKLSR in processing complex nonlinear data. Finally, we provide the global optimal closed-form solutions for both models, making them easier to implement, train, and apply in practice. Extensive experiments were conducted on both real and synthetic datasets, and the results demonstrated the effectiveness and efficiency of the proposed method.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100589"},"PeriodicalIF":4.2,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-05DOI: 10.1016/j.bdr.2025.100587
Yiding Liu, Deren Xu, Yanyong Wang
In the current era of soaring data volumes, graph computing has become crucial for processing large-scale data across various domains. However, it faces performance challenges on modern architectures like CPUs and GPUs due to irregular memory access patterns, leading to suboptimal memory bandwidth utilization.
To address this, Processing-in-Memory (PIM) technology, exemplified by the Hybrid Memory Cube (HMC), has been proposed. While HMC offers higher internal bandwidth, solely relying on it introduces numerous remote memory accesses. Additionally, neglecting host cores with caches undermines potential benefits from low-latency cache access.
In our paper, we propose VertexLocater, a dynamic offloading technique that strategically allocates processing between host cores and HMC, leveraging temporal locality with caches. This reduces remote memory accesses and enhances performance. Evaluation shows up to a 2.34x speedup and a 47% reduction in energy consumption. Further, our advanced design Multi-Level-Enabled VertexLocater (MLE-VL) optimizes multi-level overlapping of GPC and graph structure analysis, improving performance by an additional 9.7% and reducing uncore energy by 4.8%.
{"title":"VertexLocater: PIM-enabled dynamic offloading for graph computing","authors":"Yiding Liu, Deren Xu, Yanyong Wang","doi":"10.1016/j.bdr.2025.100587","DOIUrl":"10.1016/j.bdr.2025.100587","url":null,"abstract":"<div><div>In the current era of soaring data volumes, graph computing has become crucial for processing large-scale data across various domains. However, it faces performance challenges on modern architectures like CPUs and GPUs due to irregular memory access patterns, leading to suboptimal memory bandwidth utilization.</div><div>To address this, Processing-in-Memory (PIM) technology, exemplified by the Hybrid Memory Cube (HMC), has been proposed. While HMC offers higher internal bandwidth, solely relying on it introduces numerous remote memory accesses. Additionally, neglecting host cores with caches undermines potential benefits from low-latency cache access.</div><div>In our paper, we propose VertexLocater, a dynamic offloading technique that strategically allocates processing between host cores and HMC, leveraging temporal locality with caches. This reduces remote memory accesses and enhances performance. Evaluation shows up to a 2.34x speedup and a 47% reduction in energy consumption. Further, our advanced design Multi-Level-Enabled VertexLocater (MLE-VL) optimizes multi-level overlapping of GPC and graph structure analysis, improving performance by an additional 9.7% and reducing uncore energy by 4.8%.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100587"},"PeriodicalIF":4.2,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145939022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.bdr.2025.100588
Boyu Guan, Xiaodong Gu
Time series forecasting plays a crucial role in diverse real-world applications, yet existing approaches often struggle to balance predictive accuracy with computational efficiency. In this paper, we study GATTSF, a forecasting framework based on PatchTST in which the conventional MHA+FFN components are replaced with gated attention units (GAUs) and combined with an efficient patching strategy, aiming to balance accuracy and computational efficiency. Extensive experiments on multiple benchmark datasets show that GATTSF achieves competitive forecasting accuracy while reducing model complexity compared with strong baselines. This favorable trade-off between efficiency and effectiveness highlights the practicality of GATTSF for long-term forecasting. We also observe that on datasets with weak periodicity (e.g., Exchange) or extremely long horizons, the performance gap to some baselines narrows, suggesting opportunities for future improvement through hierarchical or hybrid architectures.
{"title":"Efficient time series forecasting with gated attention and patched data: A transformer-based approach","authors":"Boyu Guan, Xiaodong Gu","doi":"10.1016/j.bdr.2025.100588","DOIUrl":"10.1016/j.bdr.2025.100588","url":null,"abstract":"<div><div>Time series forecasting plays a crucial role in diverse real-world applications, yet existing approaches often struggle to balance predictive accuracy with computational efficiency. In this paper, we study GATTSF, a forecasting framework based on PatchTST in which the conventional MHA+FFN components are replaced with gated attention units (GAUs) and combined with an efficient patching strategy, aiming to balance accuracy and computational efficiency. Extensive experiments on multiple benchmark datasets show that GATTSF achieves competitive forecasting accuracy while reducing model complexity compared with strong baselines. This favorable trade-off between efficiency and effectiveness highlights the practicality of GATTSF for long-term forecasting. We also observe that on datasets with weak periodicity (e.g., Exchange) or extremely long horizons, the performance gap to some baselines narrows, suggesting opportunities for future improvement through hierarchical or hybrid architectures.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100588"},"PeriodicalIF":4.2,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145939023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.bdr.2025.100586
Chengmao Wu, Jun Hou
Entropy regularization for semi-supervised fuzzy clustering enhances accuracy while maintaining fuzzy clustering flexibility, but its limited performance restricts broader application. This paper analyzes existing entropy-based semi-supervised fuzzy algorithms, noting that when a labeled sample's membership degree matches its prior value, the entropy function weakens the prior information's influence, resulting in minimal performance gains. To address this, we propose a novel asymmetric deviation entropy for fuzzy C-means clustering with partial supervision, leading to a new semi-supervised fuzzy clustering algorithm. We prove its convergence using the Zangwill and bordered Hessian theorems, providing a solid theoretical foundation. To improve the slow convergence of semi-supervised fuzzy clustering algorithms, we use the triangle inequality to identify non-affinity clustering centers. This reduces the membership degree of samples linked to these centers and increases that of samples associated with affinity centers, leading to a faster algorithm. Experimental results demonstrate that our algorithm surpasses existing methods in accuracy, stability, and efficiency, contributing to the advancement of semi-supervised fuzzy clustering.
{"title":"Asymmetric deviation entropy regularization for semi-supervised fuzzy C-means clustering and its fast Algorithm","authors":"Chengmao Wu, Jun Hou","doi":"10.1016/j.bdr.2025.100586","DOIUrl":"10.1016/j.bdr.2025.100586","url":null,"abstract":"<div><div>Entropy regularization for semi-supervised fuzzy clustering enhances accuracy while maintaining fuzzy clustering flexibility, but its limited performance restricts broader application. This paper analyzes existing entropy-based semi-supervised fuzzy algorithms, noting that when a labeled sample's membership degree matches its prior value, the entropy function weakens the prior information's influence, resulting in minimal performance gains. To address this, we propose a novel asymmetric deviation entropy for fuzzy C-means clustering with partial supervision, leading to a new semi-supervised fuzzy clustering algorithm. We prove its convergence using the Zangwill and bordered Hessian theorems, providing a solid theoretical foundation. To improve the slow convergence of semi-supervised fuzzy clustering algorithms, we use the triangle inequality to identify non-affinity clustering centers. This reduces the membership degree of samples linked to these centers and increases that of samples associated with affinity centers, leading to a faster algorithm. Experimental results demonstrate that our algorithm surpasses existing methods in accuracy, stability, and efficiency, contributing to the advancement of semi-supervised fuzzy clustering.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100586"},"PeriodicalIF":4.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145798020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The development and evaluation of autonomous maritime vessels rely heavily on data-driven insights from iterative testing and analysis. While initial analyses are often conducted on small experimental datasets to explore key system characteristics, scaling these analyses to large datasets presents significant challenges. In this study, we extend our prior work on visual exploration of small-scale test bed data by proposing approaches to scaling the visual analytics techniques to large datasets. Using AIS data from ferry boats as a proxy for extensive maritime drone operations, we address the challenges of large-scale data exploration over eight days of repetitive ferry movements across a busy strait, simulating conditions suitable for autonomous vessels. Our approach investigates movement patterns, operational stability during repeated trips, and potential collision scenarios. To support such analyses, we propose a general, reusable workflow and a set of practical guidelines for applying visual analytics techniques to large maritime movement datasets. The findings highlight the scalability and adaptability of visual analytics methods, providing valuable tools for analyzing complex maritime datasets and advancing autonomous vessel technologies.
{"title":"Techniques for interactive visual examination of vessel performance","authors":"Natalia Andrienko , Gennady Andrienko , Dimitris Zissis , Alexandros Troupiotis-Kapeliaris , Giannis Spiliopoulos","doi":"10.1016/j.bdr.2025.100575","DOIUrl":"10.1016/j.bdr.2025.100575","url":null,"abstract":"<div><div>The development and evaluation of autonomous maritime vessels rely heavily on data-driven insights from iterative testing and analysis. While initial analyses are often conducted on small experimental datasets to explore key system characteristics, scaling these analyses to large datasets presents significant challenges. In this study, we extend our prior work on visual exploration of small-scale test bed data by proposing approaches to scaling the visual analytics techniques to large datasets. Using AIS data from ferry boats as a proxy for extensive maritime drone operations, we address the challenges of large-scale data exploration over eight days of repetitive ferry movements across a busy strait, simulating conditions suitable for autonomous vessels. Our approach investigates movement patterns, operational stability during repeated trips, and potential collision scenarios. To support such analyses, we propose a general, reusable workflow and a set of practical guidelines for applying visual analytics techniques to large maritime movement datasets. The findings highlight the scalability and adaptability of visual analytics methods, providing valuable tools for analyzing complex maritime datasets and advancing autonomous vessel technologies.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100575"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1016/j.bdr.2025.100574
Chiara Rucco, Antonella Longo, Motaz Saad
Data ingestion plays a crucial role in enterprise data management, particularly when dealing with large-scale datasets. This paper introduces the Metadata-driven INgestion Design pattern (MIND), a flexible, metadata-driven design pattern for cloud-based big data management. MIND enhances adaptability by enabling dynamic adjustments to ingestion types, schema updates, table additions, and the integration of new data sources.
Validated on the Azure cloud platform, MIND demonstrates scalability, feasibility, and efficiency in reducing ingestion time and operational complexity. By relying on metadata for pipeline orchestration, MIND offers a robust solution to the challenges of high-volume data processing, providing a more agile and maintainable approach to data workflows. This work contributes to the evolution of metadata-driven architectures and offers a foundation for future advancements in data management technologies.
{"title":"MIND: A metadata-driven INgestion design pattern for efficient data ingestion","authors":"Chiara Rucco, Antonella Longo, Motaz Saad","doi":"10.1016/j.bdr.2025.100574","DOIUrl":"10.1016/j.bdr.2025.100574","url":null,"abstract":"<div><div>Data ingestion plays a crucial role in enterprise data management, particularly when dealing with large-scale datasets. This paper introduces the Metadata-driven INgestion Design pattern (MIND), a flexible, metadata-driven design pattern for cloud-based big data management. MIND enhances adaptability by enabling dynamic adjustments to ingestion types, schema updates, table additions, and the integration of new data sources.</div><div>Validated on the Azure cloud platform, MIND demonstrates scalability, feasibility, and efficiency in reducing ingestion time and operational complexity. By relying on metadata for pipeline orchestration, MIND offers a robust solution to the challenges of high-volume data processing, providing a more agile and maintainable approach to data workflows. This work contributes to the evolution of metadata-driven architectures and offers a foundation for future advancements in data management technologies.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100574"},"PeriodicalIF":4.2,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145651807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1016/j.bdr.2025.100577
Norman Bereczki, Vilmos Simon
The substantial growth in the number of commercial vehicles has put great stress on the road infrastructure. This led to overcrowded transportation infrastructure, resulting in traffic congestion, which has become one of the most pressing problems in modern cities. The latest trends in telecommunication, sensorization and machine learning enabled engineers to design Cooperative Intelligent Transportation Systems (C-ITS) to enhance road safety and traffic efficiency by gathering and processing data. A frequently studied C-ITS application is congestion prediction. In this paper, we introduce a novel V2X-based congestion forecasting system called VEEPS (V2X-Based Traffic Congestion Prediciton System), for predicting traffic congestion, relying solely on local vehicle measurements and using information exchange between vehicles and infrastructure. The system uses a new, locally operating metric, called following distance ratio (FDR) to predict the future state of the traffic on a road section based on spatio-temporal FDR features. It overcomes the primary limitations of most existing methods, namely the requirement for a central intelligent agent and the deployment of a myriad of traffic measurement sensors, making them unfeasible and uneconomical for extensive city networks. The experimental setup shows that VEEPS can outperform existing statistical, machine learning and deep learning-based systems in terms of accuracy with a lower cost of deployment and maintenance.
{"title":"Novel V2X-based traffic congestion prediction system","authors":"Norman Bereczki, Vilmos Simon","doi":"10.1016/j.bdr.2025.100577","DOIUrl":"10.1016/j.bdr.2025.100577","url":null,"abstract":"<div><div>The substantial growth in the number of commercial vehicles has put great stress on the road infrastructure. This led to overcrowded transportation infrastructure, resulting in traffic congestion, which has become one of the most pressing problems in modern cities. The latest trends in telecommunication, sensorization and machine learning enabled engineers to design Cooperative Intelligent Transportation Systems (C-ITS) to enhance road safety and traffic efficiency by gathering and processing data. A frequently studied C-ITS application is congestion prediction. In this paper, we introduce a novel V2X-based congestion forecasting system called VEEPS (<strong>V</strong>2X-Bas<strong>e</strong>d Traffic Cong<strong>e</strong>stion <strong>P</strong>rediciton <strong>S</strong>ystem), for predicting traffic congestion, relying solely on local vehicle measurements and using information exchange between vehicles and infrastructure. The system uses a new, locally operating metric, called following distance ratio (FDR) to predict the future state of the traffic on a road section based on spatio-temporal FDR features. It overcomes the primary limitations of most existing methods, namely the requirement for a central intelligent agent and the deployment of a myriad of traffic measurement sensors, making them unfeasible and uneconomical for extensive city networks. The experimental setup shows that VEEPS can outperform existing statistical, machine learning and deep learning-based systems in terms of accuracy with a lower cost of deployment and maintenance.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100577"},"PeriodicalIF":4.2,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145651808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, a new concept of the smart city, driven by technological advancements, has emerged with a significant impact on various domains, including mobility. Among these technologies, the digital twin has recently gained attention; however, its impact on smart mobility, particularly in correlation with big data, remains underexplored. Based on these considerations, this paper aims to investigate the role of digital twin technology in conjunction with big data in the context of smart mobility.
A case study approach has been adopted to analyze the Italian context. Results highlight the ecosystem elements and identify the primary drivers for sustainable growth in integrating digital twin and big data technologies within smart mobility. Two iterative loops have been identified connecting technology service providers with mobility stakeholders, illustrating how artificial intelligence-driven, user-centric mobility solutions are co-created through perceptive and responsive mechanisms, and linking mobility stakeholders with end-users, enhancing operational efficiency through user generated knowledge, ultimately leading to improved mobility experiences and urban transportation systems.
This study contributes to the literature by providing a structured analysis of digital twin applications in smart mobility, emphasizing the interplay between big data and ecosystem dynamics. The findings offer theoretical and practical implications, highlighting opportunities for policymakers, technology providers, and mobility operators to foster sustainable and data-driven urban mobility solutions. Finally, directions for future research are discussed, outlining potential advancements in digital twin integration for smart mobility ecosystems.
{"title":"Unleashing the power of digital twin and big data as a new frontier for smart mobility: An ecosystem perspective","authors":"Francesca Loia , Claudia Perillo , Ginevra Gravili","doi":"10.1016/j.bdr.2025.100576","DOIUrl":"10.1016/j.bdr.2025.100576","url":null,"abstract":"<div><div>Nowadays, a new concept of the smart city, driven by technological advancements, has emerged with a significant impact on various domains, including mobility. Among these technologies, the digital twin has recently gained attention; however, its impact on smart mobility, particularly in correlation with big data, remains underexplored. Based on these considerations, this paper aims to investigate the role of digital twin technology in conjunction with big data in the context of smart mobility.</div><div>A case study approach has been adopted to analyze the Italian context. Results highlight the ecosystem elements and identify the primary drivers for sustainable growth in integrating digital twin and big data technologies within smart mobility. Two iterative loops have been identified connecting technology service providers with mobility stakeholders, illustrating how artificial intelligence-driven, user-centric mobility solutions are co-created through perceptive and responsive mechanisms, and linking mobility stakeholders with end-users, enhancing operational efficiency through user generated knowledge, ultimately leading to improved mobility experiences and urban transportation systems.</div><div>This study contributes to the literature by providing a structured analysis of digital twin applications in smart mobility, emphasizing the interplay between big data and ecosystem dynamics. The findings offer theoretical and practical implications, highlighting opportunities for policymakers, technology providers, and mobility operators to foster sustainable and data-driven urban mobility solutions. Finally, directions for future research are discussed, outlining potential advancements in digital twin integration for smart mobility ecosystems.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100576"},"PeriodicalIF":4.2,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}