首页 > 最新文献

Big Data Research最新文献

英文 中文
Explainable classification of astronomical uncertain time series 天文不确定时间序列的可解释分类
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-22 DOI: 10.1016/j.bdr.2026.100591
Michael Franklin Mbouopda , Emille E.O. Ishida , Engelbert Mephu Nguifo , Emmanuel Gangler
Exploring the expansion history of the universe, understanding its evolutionary stages, and predicting its future evolution are important goals in astrophysics. Today, machine learning tools are used to help achieving these goals by analyzing transient sources, which are modeled as uncertain time series. Although black-box methods achieve appreciable performance, existing interpretable time series methods failed to obtain acceptable performance for this type of data. Furthermore, data uncertainty is rarely taken into account in these methods. In this work, we propose an uncertainty-aware subsequence based model which achieves a classification comparable to that of state-of-the-art methods. Unlike conformal learning which estimates model uncertainty on predictions, our method takes data uncertainty as additional input. Moreover, our approach is explainable-by-design, giving domain experts the ability to inspect the model and explain its predictions. The explainability of the proposed method has also the potential to inspire new developments in theoretical astrophysics modeling by suggesting important subsequences which depict details of light curve shapes. The dataset, the source code of our experiment, and the results are made available on a public repository.
探索宇宙的膨胀历史,了解其演化阶段,预测其未来演化是天体物理学的重要目标。今天,机器学习工具被用来通过分析暂态源来帮助实现这些目标,暂态源被建模为不确定的时间序列。虽然黑盒方法取得了可观的性能,但现有的可解释时间序列方法无法获得可接受的性能。此外,这些方法很少考虑数据的不确定性。在这项工作中,我们提出了一种基于不确定性感知子序列的模型,该模型实现了与最先进方法相当的分类。与保形学习在预测中估计模型不确定性不同,我们的方法将数据不确定性作为额外的输入。此外,我们的方法是设计可解释的,使领域专家能够检查模型并解释其预测。所提出的方法的可解释性也有可能通过提出描述光曲线形状细节的重要子序列来激发理论天体物理建模的新发展。数据集、我们实验的源代码和结果都可以在一个公共存储库中获得。
{"title":"Explainable classification of astronomical uncertain time series","authors":"Michael Franklin Mbouopda ,&nbsp;Emille E.O. Ishida ,&nbsp;Engelbert Mephu Nguifo ,&nbsp;Emmanuel Gangler","doi":"10.1016/j.bdr.2026.100591","DOIUrl":"10.1016/j.bdr.2026.100591","url":null,"abstract":"<div><div>Exploring the expansion history of the universe, understanding its evolutionary stages, and predicting its future evolution are important goals in astrophysics. Today, machine learning tools are used to help achieving these goals by analyzing transient sources, which are modeled as uncertain time series. Although <em>black-box</em> methods achieve appreciable performance, existing interpretable time series methods failed to obtain acceptable performance for this type of data. Furthermore, data uncertainty is rarely taken into account in these methods. In this work, we propose an uncertainty-aware subsequence based model which achieves a classification comparable to that of state-of-the-art methods. Unlike conformal learning which estimates model uncertainty on predictions, our method takes data uncertainty as additional input. Moreover, our approach is explainable-by-design, giving domain experts the ability to inspect the model and explain its predictions. The explainability of the proposed method has also the potential to inspire new developments in theoretical astrophysics modeling by suggesting important subsequences which depict details of light curve shapes. The dataset, the source code of our experiment, and the results are made available on a public repository.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100591"},"PeriodicalIF":4.2,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Opinion fraud detection on massive datasets by spark 基于spark的海量数据集意见欺诈检测
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-21 DOI: 10.1016/j.bdr.2026.100590
Shahab Ghodsi, Ali Moeini
Users’ opinions are one of the most helpful decision-making criteria in online shopping. The importance of online reviews causes malicious users to write fake reviews, trying to deceive future customers. Thus, there is a need to distinguish bogus reviews from genuine ones and fraudulent users from honest ones. Nevertheless, the large volume of opinions makes the problem more challenging. Most studies focus on detecting opinion fraud while the data size is small (i.e. less than 6 million reviews), so their approaches do not fit well in the domain of massive datasets from the perspective of execution time and effectiveness. To meet this challenge, we propose a model with the following characteristics: (1) it runs on the review network, and thus is a general model. (2) it utilises an adapted version of the loopy belief propagation (LBP) algorithm to deduct fraudulent nodes. (3) it uses the Dempster-Shafer’s theory (evidence theory) to discover fake reviews. (4) it is implemented in Spark, making it capable of handling large datasets properly. Our experiments on the Amazon review dataset showed that the model is fast (returns the results on tens of millions of reviews in a few minutes) and effective (successfully detects fraudulent nodes and fake reviews)
用户的意见是网上购物中最有用的决策标准之一。在线评论的重要性导致恶意用户撰写虚假评论,试图欺骗未来的客户。因此,有必要区分虚假评论和真实评论以及欺诈用户和诚实用户。然而,大量的意见使得这个问题更具挑战性。大多数研究的重点是在数据量较小(即少于600万条评论)的情况下检测意见欺诈,因此从执行时间和有效性的角度来看,他们的方法并不适合大规模数据集领域。为了应对这一挑战,我们提出了一个具有以下特征的模型:(1)它运行在评审网络上,因此是一个通用模型。(2)利用改进的环形信念传播(LBP)算法来扣除欺诈节点。(3)运用Dempster-Shafer理论(证据理论)发现虚假评论。(4)在Spark中实现,使其能够正确处理大型数据集。我们在亚马逊评论数据集上的实验表明,该模型快速(在几分钟内返回数千万条评论的结果)且有效(成功检测欺诈节点和虚假评论)。
{"title":"Opinion fraud detection on massive datasets by spark","authors":"Shahab Ghodsi,&nbsp;Ali Moeini","doi":"10.1016/j.bdr.2026.100590","DOIUrl":"10.1016/j.bdr.2026.100590","url":null,"abstract":"<div><div>Users’ opinions are one of the most helpful decision-making criteria in online shopping. The importance of online reviews causes malicious users to write fake reviews, trying to deceive future customers. Thus, there is a need to distinguish bogus reviews from genuine ones and fraudulent users from honest ones. Nevertheless, the large volume of opinions makes the problem more challenging. Most studies focus on detecting opinion fraud while the data size is small (i.e. less than 6 million reviews), so their approaches do not fit well in the domain of massive datasets from the perspective of execution time and effectiveness. To meet this challenge, we propose a model with the following characteristics: (1) it runs on the review network, and thus is a general model. (2) it utilises an adapted version of the loopy belief propagation (LBP) algorithm to deduct fraudulent nodes. (3) it uses the Dempster-Shafer’s theory (evidence theory) to discover fake reviews. (4) it is implemented in Spark, making it capable of handling large datasets properly. Our experiments on the Amazon review dataset showed that the model is fast (returns the results on tens of millions of reviews in a few minutes) and effective (successfully detects fraudulent nodes and fake reviews)</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100590"},"PeriodicalIF":4.2,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale least squares regression based on fast spectral embedding and random Fourier feature mapping 基于快速谱嵌入和随机傅立叶特征映射的大规模最小二乘回归
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-20 DOI: 10.1016/j.bdr.2026.100589
Xingyu Li, Jinglei Liu
By grouping highly correlated data together, least squares regression (LSR) is widely applied in data analysis and clustering tasks. However, most traditional regression methods typically involve a significant amount of computation and are based on linear assumptions, which are difficult to scale to large-scale data. To address these challenges, we revisited the classical spectral clustering method, least squares regression, and proposed large-scale least squares regression based on fast spectral embedding (FSELSR). First, By creating a bipartite graph between data points and anchor points, the FSELSR method provides a low-dimensional representation of image data. This not only reduces the scale and decreases the computational complexity of processing large-scale data, but also helps to reveal the intrinsic structure of the data more. Second, we also introduce random Fourier feature mapping (RFFM) in FSELSR, which is extended to fast spectral embedding kernel LSR (FSEKLSR), thus improving the efficiency and clustering effect of FSEKLSR in processing complex nonlinear data. Finally, we provide the global optimal closed-form solutions for both models, making them easier to implement, train, and apply in practice. Extensive experiments were conducted on both real and synthetic datasets, and the results demonstrated the effectiveness and efficiency of the proposed method.
最小二乘回归(LSR)通过将高度相关的数据分组在一起,广泛应用于数据分析和聚类任务中。然而,大多数传统的回归方法通常涉及大量的计算,并且基于线性假设,难以扩展到大规模数据。为了解决这些问题,我们重新审视了经典的光谱聚类方法——最小二乘回归,并提出了基于快速光谱嵌入的大规模最小二乘回归(FSELSR)。首先,通过在数据点和锚点之间创建二部图,FSELSR方法提供了图像数据的低维表示。这不仅减少了处理大规模数据的规模和计算复杂度,而且有助于更多地揭示数据的内在结构。其次,在FSELSR中引入随机傅立叶特征映射(RFFM),并将其扩展到快速谱嵌入核LSR (FSEKLSR),从而提高了FSEKLSR处理复杂非线性数据的效率和聚类效果。最后,我们为这两个模型提供了全局最优的封闭形式解决方案,使它们更容易在实践中实现、训练和应用。在真实数据集和合成数据集上进行了大量实验,结果证明了该方法的有效性和高效性。
{"title":"Large-scale least squares regression based on fast spectral embedding and random Fourier feature mapping","authors":"Xingyu Li,&nbsp;Jinglei Liu","doi":"10.1016/j.bdr.2026.100589","DOIUrl":"10.1016/j.bdr.2026.100589","url":null,"abstract":"<div><div>By grouping highly correlated data together, least squares regression (LSR) is widely applied in data analysis and clustering tasks. However, most traditional regression methods typically involve a significant amount of computation and are based on linear assumptions, which are difficult to scale to large-scale data. To address these challenges, we revisited the classical spectral clustering method, least squares regression, and proposed large-scale least squares regression based on fast spectral embedding (FSELSR). First, By creating a bipartite graph between data points and anchor points, the FSELSR method provides a low-dimensional representation of image data. This not only reduces the scale and decreases the computational complexity of processing large-scale data, but also helps to reveal the intrinsic structure of the data more. Second, we also introduce random Fourier feature mapping (RFFM) in FSELSR, which is extended to fast spectral embedding kernel LSR (FSEKLSR), thus improving the efficiency and clustering effect of FSEKLSR in processing complex nonlinear data. Finally, we provide the global optimal closed-form solutions for both models, making them easier to implement, train, and apply in practice. Extensive experiments were conducted on both real and synthetic datasets, and the results demonstrated the effectiveness and efficiency of the proposed method.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100589"},"PeriodicalIF":4.2,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VertexLocater: PIM-enabled dynamic offloading for graph computing VertexLocater:支持pim的图形计算动态卸载
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-05 DOI: 10.1016/j.bdr.2025.100587
Yiding Liu, Deren Xu, Yanyong Wang
In the current era of soaring data volumes, graph computing has become crucial for processing large-scale data across various domains. However, it faces performance challenges on modern architectures like CPUs and GPUs due to irregular memory access patterns, leading to suboptimal memory bandwidth utilization.
To address this, Processing-in-Memory (PIM) technology, exemplified by the Hybrid Memory Cube (HMC), has been proposed. While HMC offers higher internal bandwidth, solely relying on it introduces numerous remote memory accesses. Additionally, neglecting host cores with caches undermines potential benefits from low-latency cache access.
In our paper, we propose VertexLocater, a dynamic offloading technique that strategically allocates processing between host cores and HMC, leveraging temporal locality with caches. This reduces remote memory accesses and enhances performance. Evaluation shows up to a 2.34x speedup and a 47% reduction in energy consumption. Further, our advanced design Multi-Level-Enabled VertexLocater (MLE-VL) optimizes multi-level overlapping of GPC and graph structure analysis, improving performance by an additional 9.7% and reducing uncore energy by 4.8%.
在当前数据量激增的时代,图计算对于处理跨领域的大规模数据已经变得至关重要。然而,由于不规则的内存访问模式,它在cpu和gpu等现代架构上面临性能挑战,导致内存带宽利用率达不到最佳水平。为了解决这个问题,已经提出了以混合内存立方体(HMC)为例的内存处理(PIM)技术。虽然HMC提供了更高的内部带宽,但仅仅依靠它就会引入大量的远程内存访问。此外,忽略具有缓存的主机内核会破坏低延迟缓存访问的潜在好处。在我们的论文中,我们提出了VertexLocater,这是一种动态卸载技术,可以在主机内核和HMC之间策略性地分配处理,利用缓存的时间局地性。这减少了远程内存访问并提高了性能。评估显示,高达2.34倍的加速和47%的能源消耗减少。此外,我们的先进设计multi-level - enabled VertexLocater (MLE-VL)优化了GPC和图形结构分析的多级重叠,将性能提高了9.7%,并减少了4.8%的非核心能量。
{"title":"VertexLocater: PIM-enabled dynamic offloading for graph computing","authors":"Yiding Liu,&nbsp;Deren Xu,&nbsp;Yanyong Wang","doi":"10.1016/j.bdr.2025.100587","DOIUrl":"10.1016/j.bdr.2025.100587","url":null,"abstract":"<div><div>In the current era of soaring data volumes, graph computing has become crucial for processing large-scale data across various domains. However, it faces performance challenges on modern architectures like CPUs and GPUs due to irregular memory access patterns, leading to suboptimal memory bandwidth utilization.</div><div>To address this, Processing-in-Memory (PIM) technology, exemplified by the Hybrid Memory Cube (HMC), has been proposed. While HMC offers higher internal bandwidth, solely relying on it introduces numerous remote memory accesses. Additionally, neglecting host cores with caches undermines potential benefits from low-latency cache access.</div><div>In our paper, we propose VertexLocater, a dynamic offloading technique that strategically allocates processing between host cores and HMC, leveraging temporal locality with caches. This reduces remote memory accesses and enhances performance. Evaluation shows up to a 2.34x speedup and a 47% reduction in energy consumption. Further, our advanced design Multi-Level-Enabled VertexLocater (MLE-VL) optimizes multi-level overlapping of GPC and graph structure analysis, improving performance by an additional 9.7% and reducing uncore energy by 4.8%.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100587"},"PeriodicalIF":4.2,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145939022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient time series forecasting with gated attention and patched data: A transformer-based approach 具有门控关注和补丁数据的有效时间序列预测:基于变压器的方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-03 DOI: 10.1016/j.bdr.2025.100588
Boyu Guan, Xiaodong Gu
Time series forecasting plays a crucial role in diverse real-world applications, yet existing approaches often struggle to balance predictive accuracy with computational efficiency. In this paper, we study GATTSF, a forecasting framework based on PatchTST in which the conventional MHA+FFN components are replaced with gated attention units (GAUs) and combined with an efficient patching strategy, aiming to balance accuracy and computational efficiency. Extensive experiments on multiple benchmark datasets show that GATTSF achieves competitive forecasting accuracy while reducing model complexity compared with strong baselines. This favorable trade-off between efficiency and effectiveness highlights the practicality of GATTSF for long-term forecasting. We also observe that on datasets with weak periodicity (e.g., Exchange) or extremely long horizons, the performance gap to some baselines narrows, suggesting opportunities for future improvement through hierarchical or hybrid architectures.
时间序列预测在各种现实世界的应用中起着至关重要的作用,然而现有的方法往往难以平衡预测准确性和计算效率。本文研究了一种基于PatchTST的预测框架GATTSF,该框架将传统的MHA+FFN分量替换为门控注意单元(GAUs),并结合有效的补丁策略,以平衡准确性和计算效率。在多个基准数据集上的大量实验表明,与强基线相比,GATTSF在降低模型复杂性的同时取得了具有竞争力的预测精度。这种效率和有效性之间的有利权衡突出了GATTSF用于长期预测的实用性。我们还观察到,在周期性较弱的数据集(例如,Exchange)或视界极长的数据集上,与某些基线的性能差距缩小了,这表明未来通过分层或混合架构进行改进的机会。
{"title":"Efficient time series forecasting with gated attention and patched data: A transformer-based approach","authors":"Boyu Guan,&nbsp;Xiaodong Gu","doi":"10.1016/j.bdr.2025.100588","DOIUrl":"10.1016/j.bdr.2025.100588","url":null,"abstract":"<div><div>Time series forecasting plays a crucial role in diverse real-world applications, yet existing approaches often struggle to balance predictive accuracy with computational efficiency. In this paper, we study GATTSF, a forecasting framework based on PatchTST in which the conventional MHA+FFN components are replaced with gated attention units (GAUs) and combined with an efficient patching strategy, aiming to balance accuracy and computational efficiency. Extensive experiments on multiple benchmark datasets show that GATTSF achieves competitive forecasting accuracy while reducing model complexity compared with strong baselines. This favorable trade-off between efficiency and effectiveness highlights the practicality of GATTSF for long-term forecasting. We also observe that on datasets with weak periodicity (e.g., Exchange) or extremely long horizons, the performance gap to some baselines narrows, suggesting opportunities for future improvement through hierarchical or hybrid architectures.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100588"},"PeriodicalIF":4.2,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145939023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Asymmetric deviation entropy regularization for semi-supervised fuzzy C-means clustering and its fast Algorithm 半监督模糊c均值聚类的非对称偏差熵正则化及其快速算法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-15 DOI: 10.1016/j.bdr.2025.100586
Chengmao Wu, Jun Hou
Entropy regularization for semi-supervised fuzzy clustering enhances accuracy while maintaining fuzzy clustering flexibility, but its limited performance restricts broader application. This paper analyzes existing entropy-based semi-supervised fuzzy algorithms, noting that when a labeled sample's membership degree matches its prior value, the entropy function weakens the prior information's influence, resulting in minimal performance gains. To address this, we propose a novel asymmetric deviation entropy for fuzzy C-means clustering with partial supervision, leading to a new semi-supervised fuzzy clustering algorithm. We prove its convergence using the Zangwill and bordered Hessian theorems, providing a solid theoretical foundation. To improve the slow convergence of semi-supervised fuzzy clustering algorithms, we use the triangle inequality to identify non-affinity clustering centers. This reduces the membership degree of samples linked to these centers and increases that of samples associated with affinity centers, leading to a faster algorithm. Experimental results demonstrate that our algorithm surpasses existing methods in accuracy, stability, and efficiency, contributing to the advancement of semi-supervised fuzzy clustering.
半监督模糊聚类的熵正则化在保持模糊聚类灵活性的同时提高了聚类的精度,但其性能有限,制约了其广泛应用。本文分析了现有的基于熵的半监督模糊算法,指出当标记样本的隶属度与其先验值匹配时,熵函数会减弱先验信息的影响,从而使性能增益最小。为了解决这个问题,我们提出了一种新的不对称偏差熵用于部分监督模糊c均值聚类,从而产生了一种新的半监督模糊聚类算法。利用Zangwill定理和有边Hessian定理证明了它的收敛性,提供了坚实的理论基础。为了改善半监督模糊聚类算法收敛速度慢的问题,我们使用三角形不等式来识别非亲和聚类中心。这降低了与这些中心相连的样本的隶属度,增加了与亲和中心相关的样本的隶属度,从而导致更快的算法。实验结果表明,该算法在准确率、稳定性和效率方面均优于现有方法,有助于半监督模糊聚类的进步。
{"title":"Asymmetric deviation entropy regularization for semi-supervised fuzzy C-means clustering and its fast Algorithm","authors":"Chengmao Wu,&nbsp;Jun Hou","doi":"10.1016/j.bdr.2025.100586","DOIUrl":"10.1016/j.bdr.2025.100586","url":null,"abstract":"<div><div>Entropy regularization for semi-supervised fuzzy clustering enhances accuracy while maintaining fuzzy clustering flexibility, but its limited performance restricts broader application. This paper analyzes existing entropy-based semi-supervised fuzzy algorithms, noting that when a labeled sample's membership degree matches its prior value, the entropy function weakens the prior information's influence, resulting in minimal performance gains. To address this, we propose a novel asymmetric deviation entropy for fuzzy C-means clustering with partial supervision, leading to a new semi-supervised fuzzy clustering algorithm. We prove its convergence using the Zangwill and bordered Hessian theorems, providing a solid theoretical foundation. To improve the slow convergence of semi-supervised fuzzy clustering algorithms, we use the triangle inequality to identify non-affinity clustering centers. This reduces the membership degree of samples linked to these centers and increases that of samples associated with affinity centers, leading to a faster algorithm. Experimental results demonstrate that our algorithm surpasses existing methods in accuracy, stability, and efficiency, contributing to the advancement of semi-supervised fuzzy clustering.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100586"},"PeriodicalIF":4.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145798020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Techniques for interactive visual examination of vessel performance 船舶性能交互式目视检查技术
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-01 DOI: 10.1016/j.bdr.2025.100575
Natalia Andrienko , Gennady Andrienko , Dimitris Zissis , Alexandros Troupiotis-Kapeliaris , Giannis Spiliopoulos
The development and evaluation of autonomous maritime vessels rely heavily on data-driven insights from iterative testing and analysis. While initial analyses are often conducted on small experimental datasets to explore key system characteristics, scaling these analyses to large datasets presents significant challenges. In this study, we extend our prior work on visual exploration of small-scale test bed data by proposing approaches to scaling the visual analytics techniques to large datasets. Using AIS data from ferry boats as a proxy for extensive maritime drone operations, we address the challenges of large-scale data exploration over eight days of repetitive ferry movements across a busy strait, simulating conditions suitable for autonomous vessels. Our approach investigates movement patterns, operational stability during repeated trips, and potential collision scenarios. To support such analyses, we propose a general, reusable workflow and a set of practical guidelines for applying visual analytics techniques to large maritime movement datasets. The findings highlight the scalability and adaptability of visual analytics methods, providing valuable tools for analyzing complex maritime datasets and advancing autonomous vessel technologies.
自主海上船舶的开发和评估在很大程度上依赖于迭代测试和分析的数据驱动见解。虽然最初的分析通常是在小型实验数据集上进行的,以探索关键的系统特征,但将这些分析扩展到大型数据集会带来重大挑战。在这项研究中,我们通过提出将视觉分析技术扩展到大型数据集的方法,扩展了我们之前在小规模测试平台数据的视觉探索方面的工作。利用渡轮的AIS数据作为广泛海上无人机操作的代理,我们解决了在繁忙的海峡上重复渡轮运动的8天内进行大规模数据探索的挑战,模拟了适合自主船只的条件。我们的方法研究了车辆的运动模式、多次下钻时的运行稳定性以及潜在的碰撞情况。为了支持这种分析,我们提出了一个通用的、可重用的工作流程和一套实用的指导方针,用于将可视化分析技术应用于大型海上运动数据集。研究结果强调了可视化分析方法的可扩展性和适应性,为分析复杂的海事数据集和推进自主船舶技术提供了有价值的工具。
{"title":"Techniques for interactive visual examination of vessel performance","authors":"Natalia Andrienko ,&nbsp;Gennady Andrienko ,&nbsp;Dimitris Zissis ,&nbsp;Alexandros Troupiotis-Kapeliaris ,&nbsp;Giannis Spiliopoulos","doi":"10.1016/j.bdr.2025.100575","DOIUrl":"10.1016/j.bdr.2025.100575","url":null,"abstract":"<div><div>The development and evaluation of autonomous maritime vessels rely heavily on data-driven insights from iterative testing and analysis. While initial analyses are often conducted on small experimental datasets to explore key system characteristics, scaling these analyses to large datasets presents significant challenges. In this study, we extend our prior work on visual exploration of small-scale test bed data by proposing approaches to scaling the visual analytics techniques to large datasets. Using AIS data from ferry boats as a proxy for extensive maritime drone operations, we address the challenges of large-scale data exploration over eight days of repetitive ferry movements across a busy strait, simulating conditions suitable for autonomous vessels. Our approach investigates movement patterns, operational stability during repeated trips, and potential collision scenarios. To support such analyses, we propose a general, reusable workflow and a set of practical guidelines for applying visual analytics techniques to large maritime movement datasets. The findings highlight the scalability and adaptability of visual analytics methods, providing valuable tools for analyzing complex maritime datasets and advancing autonomous vessel technologies.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100575"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIND: A metadata-driven INgestion design pattern for efficient data ingestion MIND:用于高效数据摄取的元数据驱动的摄取设计模式
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-27 DOI: 10.1016/j.bdr.2025.100574
Chiara Rucco, Antonella Longo, Motaz Saad
Data ingestion plays a crucial role in enterprise data management, particularly when dealing with large-scale datasets. This paper introduces the Metadata-driven INgestion Design pattern (MIND), a flexible, metadata-driven design pattern for cloud-based big data management. MIND enhances adaptability by enabling dynamic adjustments to ingestion types, schema updates, table additions, and the integration of new data sources.
Validated on the Azure cloud platform, MIND demonstrates scalability, feasibility, and efficiency in reducing ingestion time and operational complexity. By relying on metadata for pipeline orchestration, MIND offers a robust solution to the challenges of high-volume data processing, providing a more agile and maintainable approach to data workflows. This work contributes to the evolution of metadata-driven architectures and offers a foundation for future advancements in data management technologies.
数据摄取在企业数据管理中起着至关重要的作用,特别是在处理大规模数据集时。本文介绍了元数据驱动的摄入设计模式(MIND),这是一种灵活的、元数据驱动的基于云的大数据管理设计模式。MIND通过支持对摄取类型、模式更新、表添加和新数据源集成的动态调整来增强适应性。MIND在Azure云平台上进行了验证,证明了在减少摄取时间和操作复杂性方面的可扩展性、可行性和效率。通过依赖元数据进行管道编排,MIND为应对大容量数据处理的挑战提供了一个健壮的解决方案,为数据工作流提供了一种更敏捷和可维护的方法。这项工作促进了元数据驱动架构的发展,并为数据管理技术的未来发展奠定了基础。
{"title":"MIND: A metadata-driven INgestion design pattern for efficient data ingestion","authors":"Chiara Rucco,&nbsp;Antonella Longo,&nbsp;Motaz Saad","doi":"10.1016/j.bdr.2025.100574","DOIUrl":"10.1016/j.bdr.2025.100574","url":null,"abstract":"<div><div>Data ingestion plays a crucial role in enterprise data management, particularly when dealing with large-scale datasets. This paper introduces the Metadata-driven INgestion Design pattern (MIND), a flexible, metadata-driven design pattern for cloud-based big data management. MIND enhances adaptability by enabling dynamic adjustments to ingestion types, schema updates, table additions, and the integration of new data sources.</div><div>Validated on the Azure cloud platform, MIND demonstrates scalability, feasibility, and efficiency in reducing ingestion time and operational complexity. By relying on metadata for pipeline orchestration, MIND offers a robust solution to the challenges of high-volume data processing, providing a more agile and maintainable approach to data workflows. This work contributes to the evolution of metadata-driven architectures and offers a foundation for future advancements in data management technologies.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100574"},"PeriodicalIF":4.2,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145651807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Novel V2X-based traffic congestion prediction system 基于v2x的新型交通拥堵预测系统
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-25 DOI: 10.1016/j.bdr.2025.100577
Norman Bereczki, Vilmos Simon
The substantial growth in the number of commercial vehicles has put great stress on the road infrastructure. This led to overcrowded transportation infrastructure, resulting in traffic congestion, which has become one of the most pressing problems in modern cities. The latest trends in telecommunication, sensorization and machine learning enabled engineers to design Cooperative Intelligent Transportation Systems (C-ITS) to enhance road safety and traffic efficiency by gathering and processing data. A frequently studied C-ITS application is congestion prediction. In this paper, we introduce a novel V2X-based congestion forecasting system called VEEPS (V2X-Based Traffic Congestion Prediciton System), for predicting traffic congestion, relying solely on local vehicle measurements and using information exchange between vehicles and infrastructure. The system uses a new, locally operating metric, called following distance ratio (FDR) to predict the future state of the traffic on a road section based on spatio-temporal FDR features. It overcomes the primary limitations of most existing methods, namely the requirement for a central intelligent agent and the deployment of a myriad of traffic measurement sensors, making them unfeasible and uneconomical for extensive city networks. The experimental setup shows that VEEPS can outperform existing statistical, machine learning and deep learning-based systems in terms of accuracy with a lower cost of deployment and maintenance.
商用车数量的大幅增长给道路基础设施带来了很大的压力。这导致交通基础设施过度拥挤,导致交通拥堵,这已成为现代城市最紧迫的问题之一。电信、传感器化和机器学习的最新趋势使工程师能够设计合作智能交通系统(C-ITS),通过收集和处理数据来提高道路安全和交通效率。一个经常被研究的C-ITS应用是拥塞预测。在本文中,我们介绍了一种新的基于v2x的交通拥堵预测系统VEEPS(基于v2x的交通拥堵预测系统),该系统仅依赖于本地车辆测量并利用车辆与基础设施之间的信息交换来预测交通拥堵。该系统使用一种新的局部操作度量,称为跟随距离比(FDR),根据FDR的时空特征来预测某一路段的未来交通状态。它克服了大多数现有方法的主要局限性,即对中央智能代理的要求和无数交通测量传感器的部署,使得它们在广泛的城市网络中不可行且不经济。实验表明,VEEPS在准确性方面优于现有的基于统计、机器学习和深度学习的系统,且部署和维护成本更低。
{"title":"Novel V2X-based traffic congestion prediction system","authors":"Norman Bereczki,&nbsp;Vilmos Simon","doi":"10.1016/j.bdr.2025.100577","DOIUrl":"10.1016/j.bdr.2025.100577","url":null,"abstract":"<div><div>The substantial growth in the number of commercial vehicles has put great stress on the road infrastructure. This led to overcrowded transportation infrastructure, resulting in traffic congestion, which has become one of the most pressing problems in modern cities. The latest trends in telecommunication, sensorization and machine learning enabled engineers to design Cooperative Intelligent Transportation Systems (C-ITS) to enhance road safety and traffic efficiency by gathering and processing data. A frequently studied C-ITS application is congestion prediction. In this paper, we introduce a novel V2X-based congestion forecasting system called VEEPS (<strong>V</strong>2X-Bas<strong>e</strong>d Traffic Cong<strong>e</strong>stion <strong>P</strong>rediciton <strong>S</strong>ystem), for predicting traffic congestion, relying solely on local vehicle measurements and using information exchange between vehicles and infrastructure. The system uses a new, locally operating metric, called following distance ratio (FDR) to predict the future state of the traffic on a road section based on spatio-temporal FDR features. It overcomes the primary limitations of most existing methods, namely the requirement for a central intelligent agent and the deployment of a myriad of traffic measurement sensors, making them unfeasible and uneconomical for extensive city networks. The experimental setup shows that VEEPS can outperform existing statistical, machine learning and deep learning-based systems in terms of accuracy with a lower cost of deployment and maintenance.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100577"},"PeriodicalIF":4.2,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145651808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unleashing the power of digital twin and big data as a new frontier for smart mobility: An ecosystem perspective 释放数字孪生和大数据的力量,作为智能移动的新前沿:生态系统视角
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-24 DOI: 10.1016/j.bdr.2025.100576
Francesca Loia , Claudia Perillo , Ginevra Gravili
Nowadays, a new concept of the smart city, driven by technological advancements, has emerged with a significant impact on various domains, including mobility. Among these technologies, the digital twin has recently gained attention; however, its impact on smart mobility, particularly in correlation with big data, remains underexplored. Based on these considerations, this paper aims to investigate the role of digital twin technology in conjunction with big data in the context of smart mobility.
A case study approach has been adopted to analyze the Italian context. Results highlight the ecosystem elements and identify the primary drivers for sustainable growth in integrating digital twin and big data technologies within smart mobility. Two iterative loops have been identified connecting technology service providers with mobility stakeholders, illustrating how artificial intelligence-driven, user-centric mobility solutions are co-created through perceptive and responsive mechanisms, and linking mobility stakeholders with end-users, enhancing operational efficiency through user generated knowledge, ultimately leading to improved mobility experiences and urban transportation systems.
This study contributes to the literature by providing a structured analysis of digital twin applications in smart mobility, emphasizing the interplay between big data and ecosystem dynamics. The findings offer theoretical and practical implications, highlighting opportunities for policymakers, technology providers, and mobility operators to foster sustainable and data-driven urban mobility solutions. Finally, directions for future research are discussed, outlining potential advancements in digital twin integration for smart mobility ecosystems.
如今,在技术进步的推动下,智慧城市的新概念已经出现,对包括交通在内的各个领域产生了重大影响。在这些技术中,数字孪生最近引起了人们的关注;然而,它对智能出行的影响,特别是与大数据相关的影响,仍未得到充分探索。基于这些考虑,本文旨在研究数字孪生技术与大数据在智能移动环境中的作用。本文采用案例研究的方法来分析意大利的背景。结果强调了生态系统要素,并确定了在智能移动中集成数字孪生和大数据技术的可持续增长的主要驱动因素。已经确定了两个迭代循环,将技术服务提供商与移动出行利益相关者联系起来,说明如何通过感知和响应机制共同创建人工智能驱动的、以用户为中心的移动出行解决方案,并将移动出行利益相关者与最终用户联系起来,通过用户生成的知识提高运营效率,最终改善移动出行体验和城市交通系统。本研究对数字孪生在智能移动中的应用进行了结构化分析,强调了大数据与生态系统动态之间的相互作用。研究结果具有理论和实践意义,为政策制定者、技术提供商和移动运营商提供了促进可持续和数据驱动的城市移动解决方案的机会。最后,讨论了未来的研究方向,概述了智能移动生态系统中数字孪生集成的潜在进展。
{"title":"Unleashing the power of digital twin and big data as a new frontier for smart mobility: An ecosystem perspective","authors":"Francesca Loia ,&nbsp;Claudia Perillo ,&nbsp;Ginevra Gravili","doi":"10.1016/j.bdr.2025.100576","DOIUrl":"10.1016/j.bdr.2025.100576","url":null,"abstract":"<div><div>Nowadays, a new concept of the smart city, driven by technological advancements, has emerged with a significant impact on various domains, including mobility. Among these technologies, the digital twin has recently gained attention; however, its impact on smart mobility, particularly in correlation with big data, remains underexplored. Based on these considerations, this paper aims to investigate the role of digital twin technology in conjunction with big data in the context of smart mobility.</div><div>A case study approach has been adopted to analyze the Italian context. Results highlight the ecosystem elements and identify the primary drivers for sustainable growth in integrating digital twin and big data technologies within smart mobility. Two iterative loops have been identified connecting technology service providers with mobility stakeholders, illustrating how artificial intelligence-driven, user-centric mobility solutions are co-created through perceptive and responsive mechanisms, and linking mobility stakeholders with end-users, enhancing operational efficiency through user generated knowledge, ultimately leading to improved mobility experiences and urban transportation systems.</div><div>This study contributes to the literature by providing a structured analysis of digital twin applications in smart mobility, emphasizing the interplay between big data and ecosystem dynamics. The findings offer theoretical and practical implications, highlighting opportunities for policymakers, technology providers, and mobility operators to foster sustainable and data-driven urban mobility solutions. Finally, directions for future research are discussed, outlining potential advancements in digital twin integration for smart mobility ecosystems.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"43 ","pages":"Article 100576"},"PeriodicalIF":4.2,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1