Pub Date : 2025-01-08DOI: 10.1109/TBDATA.2025.3527208
Hongyi Lin;Yang Liu;Liang Wang;Xiaobo Qu
The rapid development of Big Data and artificial intelligence (AI) is revolutionizing the automotive and transportation industries, leading to the creation of the Autonomous Modular Bus (AMB). Designed to address the key challenges of modern public transportation systems, the AMB adopts a modular dynamic assembly approach. However, existing research on the AMB predominantly focuses on operational aspects, whereas in-transit docking remains the primary obstacle to its commercial deployment. This challenge stems from the fact that current perception accuracy in autonomous vehicles is limited to the decimeter level, with insufficient capability to manage adverse weather and complex traffic conditions. To enable AMBs to achieve full-scenario autonomous driving capabilities, this paper reviews current perception technologies from three perspectives: single-vehicle single-sensor perception, multi-sensor fusion perception, and cooperative perception. It examines the characteristics of existing perception solutions and evaluates their applicability to AMB-specific requirements. Furthermore, considering the unique challenges of in-transit docking, this paper identifies and proposes four future research directions for advancing AMB perception systems as well as general autonomous driving technologies.
{"title":"Big Data-Driven Advancements and Future Directions in Vehicle Perception Technologies: From Autonomous Driving to Modular Buses","authors":"Hongyi Lin;Yang Liu;Liang Wang;Xiaobo Qu","doi":"10.1109/TBDATA.2025.3527208","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3527208","url":null,"abstract":"The rapid development of Big Data and artificial intelligence (AI) is revolutionizing the automotive and transportation industries, leading to the creation of the Autonomous Modular Bus (AMB). Designed to address the key challenges of modern public transportation systems, the AMB adopts a modular dynamic assembly approach. However, existing research on the AMB predominantly focuses on operational aspects, whereas in-transit docking remains the primary obstacle to its commercial deployment. This challenge stems from the fact that current perception accuracy in autonomous vehicles is limited to the decimeter level, with insufficient capability to manage adverse weather and complex traffic conditions. To enable AMBs to achieve full-scenario autonomous driving capabilities, this paper reviews current perception technologies from three perspectives: single-vehicle single-sensor perception, multi-sensor fusion perception, and cooperative perception. It examines the characteristics of existing perception solutions and evaluates their applicability to AMB-specific requirements. Furthermore, considering the unique challenges of in-transit docking, this paper identifies and proposes four future research directions for advancing AMB perception systems as well as general autonomous driving technologies.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1568-1587"},"PeriodicalIF":7.5,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-08DOI: 10.1109/TBDATA.2024.3524830
Ren Li;Huazhong Liu;Xiaotong Zhou;Jiawei Wang;Jihong Ding;Laurence T. Yang;Hua Li;Yunfan Zhang
With the explosion of social media platforms, a substantial amount of data is generated from social information network. Tensor-based multi-modal clustering methods have been widely applied in various scenarios of social information network by mining potential correlative relationships from large-scale heterogeneous data. Nevertheless, the accuracy and efficiency of tensor-based multi-modal clustering methods are seriously restricted by noise data and the curse of dimensionality. Therefore, this paper presents a Tucker-based multi-modal clustering (TuMC) and an improved TuMC (ITuMC) to enhance the accuracy and efficiency of multi-modal clustering. First, we propose two Tucker-based attribute weight ranking learning approaches to calculate weight tensor efficiently. Then, we present a calculation approach for Tucker-based selective weighted tensor distance (SWTD) and a TuMC method. Meanwhile, an ITuMC method is explored by optimizing the calculation efficiency of the SWTD to further improve clustering speed. Finally, we present a Tucker-based multi-modal clustering and service framework for social information network. Extensive experimental results based on social Geolife GPS trajectory and electricity consumption datasets demonstrate that the TuMC and ITuMC methods can cluster multi-source heterogeneous data with both higher accuracy and efficiency under complex social information network by DVI, AR and execution time measurement.
{"title":"Tucker-Based High-Accuracy Multi-Modal Clustering for Social Information Network","authors":"Ren Li;Huazhong Liu;Xiaotong Zhou;Jiawei Wang;Jihong Ding;Laurence T. Yang;Hua Li;Yunfan Zhang","doi":"10.1109/TBDATA.2024.3524830","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524830","url":null,"abstract":"With the explosion of social media platforms, a substantial amount of data is generated from social information network. Tensor-based multi-modal clustering methods have been widely applied in various scenarios of social information network by mining potential correlative relationships from large-scale heterogeneous data. Nevertheless, the accuracy and efficiency of tensor-based multi-modal clustering methods are seriously restricted by noise data and the curse of dimensionality. Therefore, this paper presents a Tucker-based multi-modal clustering (TuMC) and an improved TuMC (ITuMC) to enhance the accuracy and efficiency of multi-modal clustering. First, we propose two Tucker-based attribute weight ranking learning approaches to calculate weight tensor efficiently. Then, we present a calculation approach for Tucker-based selective weighted tensor distance (SWTD) and a TuMC method. Meanwhile, an ITuMC method is explored by optimizing the calculation efficiency of the SWTD to further improve clustering speed. Finally, we present a Tucker-based multi-modal clustering and service framework for social information network. Extensive experimental results based on social Geolife GPS trajectory and electricity consumption datasets demonstrate that the TuMC and ITuMC methods can cluster multi-source heterogeneous data with both higher accuracy and efficiency under complex social information network by DVI, AR and execution time measurement.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"1677-1691"},"PeriodicalIF":7.5,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-08DOI: 10.1109/TBDATA.2025.3527238
Yao Liu;Yongfei Zhang
With the increasing amount of data in various domains, knowledge graphs (KGs) have become powerful tools for representing complex and heterogeneous information in a structured way, and for extracting valuable information from knowledge graphs through embedding techniques to support downstream tasks such as recommendation and Q&A systems. Knowledge graphs consist of triples that are continuously added as knowledge is updated. However, most existing embedding models are designed for static graphs, requiring the entire model to be retrained for each update, which is time-consuming. Existing global dynamic embedding models focus on exploiting the structural and relational information of the whole graph to achieve embedding quality, resulting in reduced dynamic efficiency. To address this problem, we propose a relational clustering-based parallel space model in which knowledge from different domains is embedded in different subspaces, allowing each subspace to focus on the data characteristics of a specific domain, thereby improving the quality of knowledge. Second, the new data only affects some subspaces but not the performance of other spaces, improving the model's adaptability to dynamics. Furthermore, we employ two incremental approaches based on the type of added data to improve the efficiency of dynamic embedding while ensuring that the added data preserves the characteristics of the parallel space. The experimental results show that the dynamic embedding efficiency of our model is improved by an average of 50.3% compared to the SOTA dynamic embedding model for the link prediction task. Particularly on FB15K, our model not only improves the efficiency by 41% but also increases the accuracy by 7.5%, demonstrating the accuracy and efficiency of our model.
{"title":"Relational Clustering-Based Parallel Spaces Construction and Embedding for Dynamic Knowledge Graph","authors":"Yao Liu;Yongfei Zhang","doi":"10.1109/TBDATA.2025.3527238","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3527238","url":null,"abstract":"With the increasing amount of data in various domains, knowledge graphs (KGs) have become powerful tools for representing complex and heterogeneous information in a structured way, and for extracting valuable information from knowledge graphs through embedding techniques to support downstream tasks such as recommendation and Q&A systems. Knowledge graphs consist of triples that are continuously added as knowledge is updated. However, most existing embedding models are designed for static graphs, requiring the entire model to be retrained for each update, which is time-consuming. Existing global dynamic embedding models focus on exploiting the structural and relational information of the whole graph to achieve embedding quality, resulting in reduced dynamic efficiency. To address this problem, we propose a relational clustering-based parallel space model in which knowledge from different domains is embedded in different subspaces, allowing each subspace to focus on the data characteristics of a specific domain, thereby improving the quality of knowledge. Second, the new data only affects some subspaces but not the performance of other spaces, improving the model's adaptability to dynamics. Furthermore, we employ two incremental approaches based on the type of added data to improve the efficiency of dynamic embedding while ensuring that the added data preserves the characteristics of the parallel space. The experimental results show that the dynamic embedding efficiency of our model is improved by an average of 50.3% compared to the SOTA dynamic embedding model for the link prediction task. Particularly on FB15K, our model not only improves the efficiency by 41% but also increases the accuracy by 7.5%, demonstrating the accuracy and efficiency of our model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2308-2320"},"PeriodicalIF":5.7,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-08DOI: 10.1109/TBDATA.2025.3527215
Yichuan Cheng;Darrick Lee;Harald Oberhauser;Haoliang Li
The objective of domain generalization is to develop a model that can handle the domain shift problem without access to the target domain. In this paper, we propose a new domain generalization approach called Decomposition Framework with Dynamic Component Alignment (DFDCA), which employs signal decomposition on input data and conducts domain alignment on each component, providing another perspective on domain generalization for time series classification. Specifically, we first utilize a neural decomposition module to decompose the original time series data into several components, and design loss functions to guide the network to effectively perform signal decomposition for class-wise domain alignment on the decomposed components. The denoising attention mechanism is then introduced to enhance informative components while suppressing task-irrelevant components. Our proposed approach is evaluated on four publicly available datasets based on the cross-domain setting where the training and test samples are drawn from different distributions. The results demonstrate that it outperforms other baseline methods, achieving state-of-the-art performance.
{"title":"Generalized Time Series Classification via Component Decomposition and Alignment","authors":"Yichuan Cheng;Darrick Lee;Harald Oberhauser;Haoliang Li","doi":"10.1109/TBDATA.2025.3527215","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3527215","url":null,"abstract":"The objective of domain generalization is to develop a model that can handle the domain shift problem without access to the target domain. In this paper, we propose a new domain generalization approach called Decomposition Framework with Dynamic Component Alignment (DFDCA), which employs signal decomposition on input data and conducts domain alignment on each component, providing another perspective on domain generalization for time series classification. Specifically, we first utilize a neural decomposition module to decompose the original time series data into several components, and design loss functions to guide the network to effectively perform signal decomposition for class-wise domain alignment on the decomposed components. The denoising attention mechanism is then introduced to enhance informative components while suppressing task-irrelevant components. Our proposed approach is evaluated on four publicly available datasets based on the cross-domain setting where the training and test samples are drawn from different distributions. The results demonstrate that it outperforms other baseline methods, achieving state-of-the-art performance.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2338-2352"},"PeriodicalIF":5.7,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01DOI: 10.1109/TBDATA.2024.3524831
Tun Li;Di Lei;Qian Li;Rong Wang;Chaolong Jia;Yunpeng Xiao
The development of social networks has prompted a shift in marketing strategies, with a surging demand for marketing in vertical domains characterized by high user stickiness and specialization. To address this, we propose a traceability model based on domain preference and heterogeneous networks. First, considering the problem of marketing topic vertical domains features metric and the influence of users’ preference degree for domains on topic propagation, the domains are treated as latent semantics, and the user-topic association matrix sparse matrix is densified using a latent factor model to mine the domain preference information efficiently. Second, considering the complexity of the association between multi-type elements in marketing topics, the HLN2vec (Heterogeneous Layer-wise Networks) model is proposed. This model uses heterogeneous network representation learning and incorporates multi-layer attention networks to learn the representations to portray a marketing topic’s key elements and their relationships. Finally, this paper proposes the DP-Rank(Domain Preference-based) algorithm, which uses domain preference features and an adaptive random walking strategy to quantify element influence. Based on experiments, the proposed model robustly applies in social networks and exhibits clear advantages in measuring vertical domain features of marketing topics, constructing multi-type element relationship networks, and discovering core element influence.
{"title":"A Marketing Topic Traceability Model Based on Domain Preference and Heterogeneous Network","authors":"Tun Li;Di Lei;Qian Li;Rong Wang;Chaolong Jia;Yunpeng Xiao","doi":"10.1109/TBDATA.2024.3524831","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524831","url":null,"abstract":"The development of social networks has prompted a shift in marketing strategies, with a surging demand for marketing in vertical domains characterized by high user stickiness and specialization. To address this, we propose a traceability model based on domain preference and heterogeneous networks. First, considering the problem of marketing topic vertical domains features metric and the influence of users’ preference degree for domains on topic propagation, the domains are treated as latent semantics, and the user-topic association matrix sparse matrix is densified using a latent factor model to mine the domain preference information efficiently. Second, considering the complexity of the association between multi-type elements in marketing topics, the HLN2vec (Heterogeneous Layer-wise Networks) model is proposed. This model uses heterogeneous network representation learning and incorporates multi-layer attention networks to learn the representations to portray a marketing topic’s key elements and their relationships. Finally, this paper proposes the DP-Rank(Domain Preference-based) algorithm, which uses domain preference features and an adaptive random walking strategy to quantify element influence. Based on experiments, the proposed model robustly applies in social networks and exhibits clear advantages in measuring vertical domain features of marketing topics, constructing multi-type element relationship networks, and discovering core element influence.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"1692-1706"},"PeriodicalIF":7.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01DOI: 10.1109/TBDATA.2024.3524832
Abdul Majeed;Seong Oun Hwang
In this paper, we propose and implement a novel anonymization model, called data-centric $ell$-diversity, to effectively safeguard the privacy of individuals with considerably enhanced utility in data publishing scenarios. Through experimental analysis of real-life datasets, we found that when the data quality is poor (e.g., distributions are uneven), most of the existing methods only anonymize some parts of the data (where distributions are balanced) and leave other parts unprocessed, which can lead to explicit privacy disclosures. Furthermore, they do not identify and repair problematic parts of the data before anonymization, and therefore, they are not secure from the threat of privacy breaches. To address these technical problems, in this paper, we implement an automated method that identifies vulnerabilities in the underlying data to be anonymized w.r.t. distribution, and that repairs them by injecting virtual samples of good quality. Later, we implement a data partitioning strategy that creates compact and diverse classes of size $k$, where $k$ is the privacy parameter. Finally, only shallow generalization (or no generalization) is applied to each class to minimally generalize the data, whereas existing methods overly distort data by not improving the quality beforehand, which can lead to poor utility in data-driven services. We conducted detailed experiments on four datasets to justify the performance of our model in realistic scenarios, and achieved promising results from the perspectives of boosted accuracy, privacy preservation, data utility enrichment, and reduced computing overheads. Compared with baseline methods, our model enhanced privacy preservation by 36.56% on three different metrics, and data utility was augmented with 18.65% less information loss and 14.37% greater accuracy. Lastly, our model, on average, has shown a 26.13% reduction in time overheads compared to the SOTA baseline methods.
{"title":"A Data-Centric $ell$ℓ-Diversity Model for Securely Publishing Personal Data With Enhanced Utility","authors":"Abdul Majeed;Seong Oun Hwang","doi":"10.1109/TBDATA.2024.3524832","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524832","url":null,"abstract":"In this paper, we propose and implement a novel anonymization model, called data-centric <inline-formula><tex-math>$ell$</tex-math></inline-formula>-diversity, to effectively safeguard the privacy of individuals with considerably enhanced utility in data publishing scenarios. Through experimental analysis of real-life datasets, we found that when the data quality is poor (e.g., distributions are uneven), most of the existing methods only anonymize some parts of the data (where distributions are balanced) and leave other parts unprocessed, which can lead to explicit privacy disclosures. Furthermore, they do not identify and repair problematic parts of the data before anonymization, and therefore, they are not secure from the threat of privacy breaches. To address these technical problems, in this paper, we implement an automated method that identifies vulnerabilities in the underlying data to be anonymized w.r.t. distribution, and that repairs them by injecting virtual samples of good quality. Later, we implement a data partitioning strategy that creates compact and diverse classes of size <inline-formula><tex-math>$k$</tex-math></inline-formula>, where <inline-formula><tex-math>$k$</tex-math></inline-formula> is the privacy parameter. Finally, only shallow generalization (or no generalization) is applied to each class to minimally generalize the data, whereas existing methods overly distort data by not improving the quality beforehand, which can lead to poor utility in data-driven services. We conducted detailed experiments on four datasets to justify the performance of our model in realistic scenarios, and achieved promising results from the perspectives of boosted accuracy, privacy preservation, data utility enrichment, and reduced computing overheads. Compared with baseline methods, our model enhanced privacy preservation by 36.56% on three different metrics, and data utility was augmented with 18.65% less information loss and 14.37% greater accuracy. Lastly, our model, on average, has shown a 26.13% reduction in time overheads compared to the SOTA baseline methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2278-2295"},"PeriodicalIF":5.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144989931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01DOI: 10.1109/TBDATA.2024.3524829
Qingxi Peng;Zhenjie Weng;Wei Wang;Xinyi Wang;Lan You
Aiming at the problem that the GitHub platform only supports the retrieval of developers through their usernames and it is difficult to directly obtain developers' expertise information, this paper proposes an open source domain expert retrieval model (OSDERM) based on the network representation learning algorithm OSC2vec (Open Source Collaboration to Vector). The model mainly consists of two core parts: Expert Profiling and Expert Finding. Expert Profiling aims to enrich the expertise information in the search results by labeling the expertise of developers; while Expert Finding achieves rapid location of the most suitable domain experts through keyword matching, which greatly saves the time and effort of searching for experts in the open source community. Experiments using the GitHub ecological dataset show that the model outperforms existing comparative algorithms in discovering open source domain experts, and can provide an effective reference for enterprise recruitment
针对GitHub平台仅支持通过用户名检索开发人员,难以直接获取开发人员专业知识信息的问题,本文提出了一种基于网络表示学习算法OSC2vec (open source Collaboration to Vector)的开源领域专家检索模型(OSDERM)。该模型主要由专家分析和专家发现两个核心部分组成。专家分析的目的是通过标注开发人员的专业知识来丰富搜索结果中的专业知识信息;而Expert Finding则通过关键字匹配快速定位到最适合的领域专家,大大节省了在开源社区中搜索专家的时间和精力。使用GitHub生态数据集进行的实验表明,该模型在发现开源领域专家方面优于现有的比较算法,可以为企业招聘提供有效的参考
{"title":"A Collaborative Network-Based Retrieval Model for Open Source Domain Experts","authors":"Qingxi Peng;Zhenjie Weng;Wei Wang;Xinyi Wang;Lan You","doi":"10.1109/TBDATA.2024.3524829","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524829","url":null,"abstract":"Aiming at the problem that the GitHub platform only supports the retrieval of developers through their usernames and it is difficult to directly obtain developers' expertise information, this paper proposes an open source domain expert retrieval model (OSDERM) based on the network representation learning algorithm OSC2vec (Open Source Collaboration to Vector). The model mainly consists of two core parts: Expert Profiling and Expert Finding. Expert Profiling aims to enrich the expertise information in the search results by labeling the expertise of developers; while Expert Finding achieves rapid location of the most suitable domain experts through keyword matching, which greatly saves the time and effort of searching for experts in the open source community. Experiments using the GitHub ecological dataset show that the model outperforms existing comparative algorithms in discovering open source domain experts, and can provide an effective reference for enterprise recruitment","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"1720-1732"},"PeriodicalIF":7.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01DOI: 10.1109/TBDATA.2024.3524839
Biao Wang;Zhao Li;Zenghui Xu;Ji Zhang
Predicting the popularity of information in social networks is crucial for effective social marketing and recommendation systems. However, accurately comprehending the complex dynamics of information diffusion remains a challenging task. Existing methods, including feature-based approaches, point process models, and deep learning techniques, often fail to capture the fine-grained features of information cascades, such as dynamic diffusion patterns, cascade statistics, and the interplay between spatial and temporal information. To address these limitations, we propose Casformer, a novel graph-based Transformer architecture that effectively learns both micro-level time-aware structural information and macro-level long-term influence along the information propagation process. Casformer employs a cascade attention network (CAT) to capture the micro-level features and a Transformer model to learn the macro-level influence. Furthermore, we introduce an adaptive cascade graph sampling strategy based on the temporal diffusion pattern and cascade statistics of information to obtain the most informative cascade graph sequence. By leveraging multi-level fine-grained evolving features of information cascades, Casformer achieves high accuracy in information popularity prediction. Experimental results on real-world social network and scientific citation network datasets demonstrate the effectiveness and superiority of Casformer compared to state-of-the-art methods in information popularity prediction.
{"title":"Casformer: Information Popularity Prediction With Adaptive Cascade Sampling and Graph Transformer in Social Networks","authors":"Biao Wang;Zhao Li;Zenghui Xu;Ji Zhang","doi":"10.1109/TBDATA.2024.3524839","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524839","url":null,"abstract":"Predicting the popularity of information in social networks is crucial for effective social marketing and recommendation systems. However, accurately comprehending the complex dynamics of information diffusion remains a challenging task. Existing methods, including feature-based approaches, point process models, and deep learning techniques, often fail to capture the fine-grained features of information cascades, such as dynamic diffusion patterns, cascade statistics, and the interplay between spatial and temporal information. To address these limitations, we propose Casformer, a novel graph-based Transformer architecture that effectively learns both micro-level time-aware structural information and macro-level long-term influence along the information propagation process. Casformer employs a cascade attention network (CAT) to capture the micro-level features and a Transformer model to learn the macro-level influence. Furthermore, we introduce an adaptive cascade graph sampling strategy based on the temporal diffusion pattern and cascade statistics of information to obtain the most informative cascade graph sequence. By leveraging multi-level fine-grained evolving features of information cascades, Casformer achieves high accuracy in information popularity prediction. Experimental results on real-world social network and scientific citation network datasets demonstrate the effectiveness and superiority of Casformer compared to state-of-the-art methods in information popularity prediction.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"1652-1663"},"PeriodicalIF":7.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Top-k Personalized PageRank (PPR) is a graph analysis method used to determine the $k$ most important nodes with respect to a source node. To realize fast Top-k PPR computation, indexing for each node is effective. When we apply the index-based Top-k PPR methods to dynamic graphs, the index becomes stale with edge updates, and index correction is required. Although the existing methods perform index correction for every update to guarantee Top-k PPR accuracy, they involve heavy re-indexing computation or significant memory overhead. This paper proposes a method that achieves comparable accuracy to guaranteed methods while significantly reducing re-indexing by focusing on the fact that index references are concentrated on the nodes whose index is unlikely to change due to edge updates. In particular, our method omits re-indexing as long as we achieve comparable accuracy. Furthermore, our method involves the minimum memory overhead among the existing index-based methods. The space complexity of the index is $Theta (n + m)$, where $n$ and $m$ are the number of nodes and edges of the graph, respectively. The evaluation results using real-world datasets show that our method achieves more than 0.999 Normalized Discounted Cumulative Gain until 20% of edges are updated from index generation.
{"title":"Reducing Re-Indexing for Top-k Personalized PageRank Computation on Dynamic Graphs","authors":"Tsuyoshi Yamashita;Naoki Matsumoto;Kunitake Kaneko","doi":"10.1109/TBDATA.2024.3524833","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524833","url":null,"abstract":"Top-k Personalized PageRank (PPR) is a graph analysis method used to determine the <inline-formula><tex-math>$k$</tex-math></inline-formula> most important nodes with respect to a source node. To realize fast Top-k PPR computation, indexing for each node is effective. When we apply the index-based Top-k PPR methods to dynamic graphs, the index becomes stale with edge updates, and index correction is required. Although the existing methods perform index correction for every update to guarantee Top-k PPR accuracy, they involve heavy re-indexing computation or significant memory overhead. This paper proposes a method that achieves comparable accuracy to guaranteed methods while significantly reducing re-indexing by focusing on the fact that index references are concentrated on the nodes whose index is unlikely to change due to edge updates. In particular, our method omits re-indexing as long as we achieve comparable accuracy. Furthermore, our method involves the minimum memory overhead among the existing index-based methods. The space complexity of the index is <inline-formula><tex-math>$Theta (n + m)$</tex-math></inline-formula>, where <inline-formula><tex-math>$n$</tex-math></inline-formula> and <inline-formula><tex-math>$m$</tex-math></inline-formula> are the number of nodes and edges of the graph, respectively. The evaluation results using real-world datasets show that our method achieves more than 0.999 Normalized Discounted Cumulative Gain until 20% of edges are updated from index generation.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"1707-1719"},"PeriodicalIF":7.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10819623","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01DOI: 10.1109/TBDATA.2024.3524828
Khondhaker Al Momin;Arif Mohaimin Sadri;Kristin Olofsson;K.K. Muraleetharan;Hugh Gladwin
In an era increasingly affected by natural and human-caused disasters, the role of social media in disaster communication has become ever more critical. Despite substantial research on social media use during crises, a significant gap remains in detecting crisis-related misinformation. Detecting deviations in information is fundamental for identifying and curbing the spread of misinformation. This study introduces a novel Information Switching Pattern Model to identify dynamic shifts in perspectives among users who mention each other in crisis-related narratives on social media. These shifts serve as evidence of crisis misinformation affecting user-mention network interactions. The study utilizes advanced natural language processing, network science, and census data to analyze geotagged tweets related to compound disaster events in Oklahoma in 2022. The impact of misinformation is revealed by distinct engagement patterns among various user types, such as bots, private organizations, non-profits, government agencies, and news media throughout different disaster stages. These patterns show how different disasters influence public sentiment, highlight the heightened vulnerability of mobile home communities, and underscore the importance of education and transportation access in crisis response. Understanding these engagement patterns is crucial for detecting misinformation and leveraging social media as an effective tool for risk communication during disasters.
{"title":"Information Switching Patterns of Risk Communication in Social Media During Disasters","authors":"Khondhaker Al Momin;Arif Mohaimin Sadri;Kristin Olofsson;K.K. Muraleetharan;Hugh Gladwin","doi":"10.1109/TBDATA.2024.3524828","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524828","url":null,"abstract":"In an era increasingly affected by natural and human-caused disasters, the role of social media in disaster communication has become ever more critical. Despite substantial research on social media use during crises, a significant gap remains in detecting crisis-related misinformation. Detecting deviations in information is fundamental for identifying and curbing the spread of misinformation. This study introduces a novel <italic>Information Switching Pattern Model</i> to identify dynamic shifts in perspectives among users who mention each other in crisis-related narratives on social media. These shifts serve as evidence of crisis misinformation affecting user-mention network interactions. The study utilizes advanced natural language processing, network science, and census data to analyze geotagged tweets related to compound disaster events in Oklahoma in 2022. The impact of misinformation is revealed by distinct engagement patterns among various user types, such as bots, private organizations, non-profits, government agencies, and news media throughout different disaster stages. These patterns show how different disasters influence public sentiment, highlight the heightened vulnerability of mobile home communities, and underscore the importance of education and transportation access in crisis response. Understanding these engagement patterns is crucial for detecting misinformation and leveraging social media as an effective tool for risk communication during disasters.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"1733-1744"},"PeriodicalIF":7.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10820023","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144606221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}