Large Language models (LLMs) usually rely on extensive training datasets. In the financial domain, creating numerical reasoning datasets that include a mix of tables and long text often involves substantial manual annotation expenses. To address the limited data resources and reduce the annotation cost, we introduce FinLLMs, a method for generating financial question-answering (QA) data based on common financial formulas using LLMs. First, we compile a list of common financial formulas and construct a graph based on the variables these formulas employ. We then augment the formula set by combining those that share identical variables as new elements. Specifically, we explore formulas obtained by manual annotation and merge those formulas with shared variables by traversing the constructed graph. Finally, utilizing LLMs, we generate financial QA data that encompasses both tabular information and long textual content, building on the collected formula set. Our experiments demonstrate that the synthetic data generated by FinLLMs effectively enhances the performance of various numerical reasoning models in the financial domain, including both pre-trained language models (PLMs) and fine-tuned LLMs. This performance surpasses that of two established benchmark financial QA datasets.
{"title":"FinLLMs: A Framework for Financial Reasoning Dataset Generation With Large Language Models","authors":"Ziqiang Yuan;Kaiyuan Wang;Shoutai Zhu;Ye Yuan;Jingya Zhou;Yanlin Zhu;Wenqi Wei","doi":"10.1109/TBDATA.2024.3524083","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524083","url":null,"abstract":"Large Language models (LLMs) usually rely on extensive training datasets. In the financial domain, creating numerical reasoning datasets that include a mix of tables and long text often involves substantial manual annotation expenses. To address the limited data resources and reduce the annotation cost, we introduce FinLLMs, a method for generating financial question-answering (QA) data based on common financial formulas using LLMs. First, we compile a list of common financial formulas and construct a graph based on the variables these formulas employ. We then augment the formula set by combining those that share identical variables as new elements. Specifically, we explore formulas obtained by manual annotation and merge those formulas with shared variables by traversing the constructed graph. Finally, utilizing LLMs, we generate financial QA data that encompasses both tabular information and long textual content, building on the collected formula set. Our experiments demonstrate that the synthetic data generated by FinLLMs effectively enhances the performance of various numerical reasoning models in the financial domain, including both pre-trained language models (PLMs) and fine-tuned LLMs. This performance surpasses that of two established benchmark financial QA datasets.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2264-2277"},"PeriodicalIF":5.7,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30DOI: 10.1109/TBDATA.2024.3524081
Jinsong Chen;Chang Liu;Kaiyuan Gao;Gaichao Li;Kun He
Graph Transformers, emerging as a new architecture for graph representation learning, suffer from the quadratic complexity and can only handle graphs with at most thousands of nodes. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations, producing a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs with millions of nodes. To further enhance the model's generalization, we propose NAGphormer+, an extended model of NAGphormer with a novel data augmentation method called Neighborhood Augmentation (NrAug). Based on the output of Hop2Token, NrAug simultaneously augments the features of neighborhoods from global as well as local views. In this way, NAGphormer+ can fully utilize the neighborhood information of multiple nodes, thereby undergoing more comprehensive training and improving the model's generalization capability. Extensive experiments on benchmark datasets from small to large demonstrate the superiority of NAGphormer+ against existing graph Transformers and mainstream GNNs, as well as the original NAGphormer.
{"title":"NAGphormer+: A Tokenized Graph Transformer With Neighborhood Augmentation for Node Classification in Large Graphs","authors":"Jinsong Chen;Chang Liu;Kaiyuan Gao;Gaichao Li;Kun He","doi":"10.1109/TBDATA.2024.3524081","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3524081","url":null,"abstract":"Graph Transformers, emerging as a new architecture for graph representation learning, suffer from the quadratic complexity and can only handle graphs with at most thousands of nodes. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations, producing a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs with millions of nodes. To further enhance the model's generalization, we propose NAGphormer+, an extended model of NAGphormer with a novel data augmentation method called Neighborhood Augmentation (NrAug). Based on the output of Hop2Token, NrAug simultaneously augments the features of neighborhoods from global as well as local views. In this way, NAGphormer+ can fully utilize the neighborhood information of multiple nodes, thereby undergoing more comprehensive training and improving the model's generalization capability. Extensive experiments on benchmark datasets from small to large demonstrate the superiority of NAGphormer+ against existing graph Transformers and mainstream GNNs, as well as the original NAGphormer.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"2085-2098"},"PeriodicalIF":7.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-view multi-label classification is a crucial machine learning paradigm aimed at building robust multi-label predictors by integrating heterogeneous features from various sources while addressing multiple correlated labels. However, in real-world applications, concerns over data confidentiality and security often prevent data exchange or fusion across different sources, leading to the challenging issue of data islands. To tackle this problem, we propose a general federated multi-view multi-label classification method, FMVML, which integrates a novel multi-view multi-label classification technique into a federated learning framework. This approach enables cross-view feature fusion and multi-label semantic classification while preserving the data privacy of each independent source. Within this federated framework, we first extract view-specific information from each individual client to capture unique characteristics and then consolidate consensus information from different views on the global server to represent shared features. Unlike previous methods, our approach enhances cross-view fusion and semantic expression by jointly capturing both feature and semantic aspects of specificity and commonality. The final label predictions are generated by combining the view-specific predictions from individual clients and the consensus predictions from the global server. Extensive experiments across various applications demonstrate that FMVML fully leverages multi-view data in a privacy-preserving manner and consistently outperforms state-of-the-art methods.
{"title":"Federated Multi-View Multi-Label Classification","authors":"Hongdao Meng;Yongjian Deng;Qiyu Zhong;Yipeng Wang;Zhen Yang;Gengyu Lyu","doi":"10.1109/TBDATA.2024.3522812","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3522812","url":null,"abstract":"Multi-view multi-label classification is a crucial machine learning paradigm aimed at building robust multi-label predictors by integrating heterogeneous features from various sources while addressing multiple correlated labels. However, in real-world applications, concerns over data confidentiality and security often prevent data exchange or fusion across different sources, leading to the challenging issue of data islands. To tackle this problem, we propose a general federated multi-view multi-label classification method, FMVML, which integrates a novel multi-view multi-label classification technique into a federated learning framework. This approach enables cross-view feature fusion and multi-label semantic classification while preserving the data privacy of each independent source. Within this federated framework, we first extract view-specific information from each individual client to capture unique characteristics and then consolidate consensus information from different views on the global server to represent shared features. Unlike previous methods, our approach enhances cross-view fusion and semantic expression by jointly capturing both feature and semantic aspects of specificity and commonality. The final label predictions are generated by combining the view-specific predictions from individual clients and the consensus predictions from the global server. Extensive experiments across various applications demonstrate that FMVML fully leverages multi-view data in a privacy-preserving manner and consistently outperforms state-of-the-art methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"2072-2084"},"PeriodicalIF":7.5,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Semantic expertise remains a reliable foundation for industrial decision-making, while Large Language Models (LLMs) can augment the often limited empirical knowledge by generating domain-specific insights, though the quality of this generative knowledge is uncertain. Integrating LLMs with the collective wisdom of multiple stakeholders could enhance the quality and scale of knowledge, yet this integration might inadvertently raise privacy concerns for stakeholders. In response to this challenge, Federated Learning (FL) is harnessed to improve the knowledge base quality by cryptically leveraging other stakeholders’ knowledge, where knowledge base is represented in Knowledge Graph (KG) form. Initially, a multi-field hyperbolic (MFH) graph embedding method vectorizes entities, furnishing mathematical representations in lieu of solely semantic meanings. The FL framework subsequently encrypted identifies and fuses common entities, whereby the updated entities’ embedding can refine other private entities’ embedding locally, thus enhancing the overall KG quality. Finally, the KG complement method refines and clarifies triplets to improve the overall quality of the KG. An experiment assesses the proposed approach across different industrial KGs, confirming its effectiveness as a viable solution for collaborative KG creation, all while maintaining data security.
{"title":"Unlocking Large Language Model Power in Industry: Privacy-Preserving Collaborative Creation of Knowledge Graph","authors":"Liqiao Xia;Junming Fan;Ajith Parlikad;Xiao Huang;Pai Zheng","doi":"10.1109/TBDATA.2024.3522814","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3522814","url":null,"abstract":"Semantic expertise remains a reliable foundation for industrial decision-making, while Large Language Models (LLMs) can augment the often limited empirical knowledge by generating domain-specific insights, though the quality of this generative knowledge is uncertain. Integrating LLMs with the collective wisdom of multiple stakeholders could enhance the quality and scale of knowledge, yet this integration might inadvertently raise privacy concerns for stakeholders. In response to this challenge, Federated Learning (FL) is harnessed to improve the knowledge base quality by cryptically leveraging other stakeholders’ knowledge, where knowledge base is represented in Knowledge Graph (KG) form. Initially, a multi-field hyperbolic (MFH) graph embedding method vectorizes entities, furnishing mathematical representations in lieu of solely semantic meanings. The FL framework subsequently encrypted identifies and fuses common entities, whereby the updated entities’ embedding can refine other private entities’ embedding locally, thus enhancing the overall KG quality. Finally, the KG complement method refines and clarifies triplets to improve the overall quality of the KG. An experiment assesses the proposed approach across different industrial KGs, confirming its effectiveness as a viable solution for collaborative KG creation, all while maintaining data security.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"2046-2060"},"PeriodicalIF":7.5,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/TBDATA.2024.3522804
Jiajun Sun;Dianliang Wu
The promising applications of mobile crowdsensing (MCS) have attracted much research interest recently, especially for the posted-pricing scenes. However, existing works mainly focus on the stationary MCS, no matter whether in a stochastic or adversarial environment, where each price (or arm) remains identical over time. However, in many realistic MCS applications such as environment monitoring and recommendation systems, stationary bandits do not model the posted-pricing sequential decision problems where the reward distributions of each price (arm) and cost distribution vary over time due to the changes in light intensity and mobile devices’ remnant energy. While in this paper, we study a more general submodular crowdsensing scene to address the non-stationary sequential pricing problems, and construct a monotonic submodular function merging the marginal reward and temporal difference errors (TD-errors) of deep reinforcement learning (DRL). Moreover, we explore a weighted budget-limited non-stationary pricing mechanism by using the deep deterministic policy gradient (DDPG) method for submodular MCS from the perspectives of the hard-drop and soft-drop weights. Our mechanism can readily be extended to non-submodular MCS or other MCS scenes. Extensive simulations demonstrate that our mechanism outweighs existing benchmarks.
{"title":"Online Non-Stationary Pricing Incentives for Budget-Limited Crowdsensing","authors":"Jiajun Sun;Dianliang Wu","doi":"10.1109/TBDATA.2024.3522804","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3522804","url":null,"abstract":"The promising applications of mobile crowdsensing (MCS) have attracted much research interest recently, especially for the posted-pricing scenes. However, existing works mainly focus on the stationary MCS, no matter whether in a stochastic or adversarial environment, where each price (or arm) remains identical over time. However, in many realistic MCS applications such as environment monitoring and recommendation systems, stationary bandits do not model the posted-pricing sequential decision problems where the reward distributions of each price (arm) and cost distribution vary over time due to the changes in light intensity and mobile devices’ remnant energy. While in this paper, we study a more general submodular crowdsensing scene to address the non-stationary sequential pricing problems, and construct a monotonic submodular function merging the marginal reward and temporal difference errors (TD-errors) of deep reinforcement learning (DRL). Moreover, we explore a weighted budget-limited non-stationary pricing mechanism by using the deep deterministic policy gradient (DDPG) method for submodular MCS from the perspectives of the hard-drop and soft-drop weights. Our mechanism can readily be extended to non-submodular MCS or other MCS scenes. Extensive simulations demonstrate that our mechanism outweighs existing benchmarks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"2025-2035"},"PeriodicalIF":7.5,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/TBDATA.2024.3522805
Liner Yang;Jiaxin Yuan;Cunliang Kong;Jingsi Yu;Ruining Chong;Zhenghao Liu;Erhong Yang
The task of complexity-controllable definition generation refers to providing definitions with different readability for words in specific contexts. This task can be utilized to help language learners eliminate reading barriers and facilitate language acquisition. However, the available training data for this task remains scarce due to the difficulty of obtaining reliable definition data and the high cost of data standardization. To tackle those challenges, we introduce a general solution from both the data-driven and method-driven perspectives. We construct a large-scale standard Chinese dataset, COMPILING, which contains both difficult and simple definitions and can serve as a benchmark for future research. Besides, we propose a multitasking framework SimpDefiner for unsupervised controllable definition generation. By designing a parameter-sharing scheme between two decoders, the framework can extract the complexity information from the non-parallel corpus. Moreover, we propose the SimpDefiner guided prompting (SGP) method, where simple definitions generated by SimpDefiner are utilized to construct prompts for GPT-4, hence obtaining more realistic and contextually appropriate definitions. The results demonstrate SimpDefiner's outstanding ability to achieve controllable generation and better results could be achieved when GPT-4 is incorporated.
{"title":"Tailored Definitions With Easy Reach: Complexity-Controllable Definition Generation","authors":"Liner Yang;Jiaxin Yuan;Cunliang Kong;Jingsi Yu;Ruining Chong;Zhenghao Liu;Erhong Yang","doi":"10.1109/TBDATA.2024.3522805","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3522805","url":null,"abstract":"The task of complexity-controllable definition generation refers to providing definitions with different readability for words in specific contexts. This task can be utilized to help language learners eliminate reading barriers and facilitate language acquisition. However, the available training data for this task remains scarce due to the difficulty of obtaining reliable definition data and the high cost of data standardization. To tackle those challenges, we introduce a general solution from both the data-driven and method-driven perspectives. We construct a large-scale standard Chinese dataset, COMPILING, which contains both difficult and simple definitions and can serve as a benchmark for future research. Besides, we propose a multitasking framework SimpDefiner for unsupervised controllable definition generation. By designing a parameter-sharing scheme between two decoders, the framework can extract the complexity information from the non-parallel corpus. Moreover, we propose the SimpDefiner guided prompting (SGP) method, where simple definitions generated by SimpDefiner are utilized to construct prompts for GPT-4, hence obtaining more realistic and contextually appropriate definitions. The results demonstrate SimpDefiner's outstanding ability to achieve controllable generation and better results could be achieved when GPT-4 is incorporated.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"2061-2071"},"PeriodicalIF":7.5,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/TBDATA.2024.3522817
Shicheng Cui;Deqiang Li;Jing Zhang
Graph Neural Networks (GNNs) have been proven to be useful for learning graph-based knowledge. However, one of the drawbacks of GNN techniques is that they may get stuck in the problem of over-squashing. Recent studies attribute to the message passing paradigm that it may amplify some specific local relations and distort long-range information under a certain GNN. To alleviate such phenomena, we propose a novel and general GNN framework, dubbed MC-GNN, which introduces the multi-channel neural architecture to learn and fuse multi-view graph-based information. The purpose of MC-GNN is to extract distinct channel-based graph features and adaptively adjust the importance of the features. To this end, we use the Hilbert-Schmidt Independence Criterion (HSIC) to enlarge the disparity between the embeddings encoded by each channel and follow an attention mechanism to fuse the embeddings with adaptive weight adjustment. MC-GNN can apply multiple GNN backbones, which provides a solution for learning structural relations from a multi-view perspective. Experimental results demonstrate that the proposed MC-GNN is superior to the compared state-of-the-art GNN methods.
{"title":"MC-GNN: Multi-Channel Graph Neural Networks With Hilbert-Schmidt Independence Criterion","authors":"Shicheng Cui;Deqiang Li;Jing Zhang","doi":"10.1109/TBDATA.2024.3522817","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3522817","url":null,"abstract":"Graph Neural Networks (GNNs) have been proven to be useful for learning graph-based knowledge. However, one of the drawbacks of GNN techniques is that they may get stuck in the problem of over-squashing. Recent studies attribute to the message passing paradigm that it may amplify some specific local relations and distort long-range information under a certain GNN. To alleviate such phenomena, we propose a novel and general GNN framework, dubbed MC-GNN, which introduces the multi-channel neural architecture to learn and fuse multi-view graph-based information. The purpose of MC-GNN is to extract distinct channel-based graph features and adaptively adjust the importance of the features. To this end, we use the Hilbert-Schmidt Independence Criterion (HSIC) to enlarge the disparity between the embeddings encoded by each channel and follow an attention mechanism to fuse the embeddings with adaptive weight adjustment. MC-GNN can apply multiple GNN backbones, which provides a solution for learning structural relations from a multi-view perspective. Experimental results demonstrate that the proposed MC-GNN is superior to the compared state-of-the-art GNN methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"2036-2045"},"PeriodicalIF":7.5,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-25DOI: 10.1109/TBDATA.2024.3522801
Otabek Sattarov;Jaeyoung Choi
With the recent advancements in social network platform technology, an overwhelming amount of information is spreading rapidly. In this situation, it can become increasingly difficult to discern what information is false or true. If false information proliferates significantly, it can lead to undesirable outcomes. Hence, when we receive some information, we can pose the following two questions: $(i)$ Is the information true? $(ii)$ If not, who initially spread that information? The first problem is the rumor detection issue, while the second is the rumor source detection problem. A rumor-detection problem involves identifying and mitigating false or misleading information spread via various communication channels, particularly online platforms and social media. Rumors can range from harmless ones to deliberately misleading content aimed at deceiving or manipulating audiences. Detecting misinformation is crucial for maintaining the integrity of information ecosystems and preventing harmful effects such as the spread of false beliefs, polarization, and even societal harm. Therefore, it is very important to quickly distinguish such misinformation while simultaneously finding its source to block it from spreading on the network. However, most of the existing surveys have analyzed these two issues separately. In this work, we first survey the existing research on the rumor-detection and rumor source detection problems with joint detection approaches, simultaneously. This survey deals with these two issues together so that their relationship can be observed and it provides how the two problems are similar and different. The limitations arising from the rumor detection, rumor source detection, and their combination problems are also explained, and some challenges to be addressed in future works are presented.
{"title":"Detection of Rumors and Their Sources in Social Networks: A Comprehensive Survey","authors":"Otabek Sattarov;Jaeyoung Choi","doi":"10.1109/TBDATA.2024.3522801","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3522801","url":null,"abstract":"With the recent advancements in social network platform technology, an overwhelming amount of information is spreading rapidly. In this situation, it can become increasingly difficult to discern what information is false or true. If false information proliferates significantly, it can lead to undesirable outcomes. Hence, when we receive some information, we can pose the following two questions: <inline-formula><tex-math>$(i)$</tex-math></inline-formula> Is the information true? <inline-formula><tex-math>$(ii)$</tex-math></inline-formula> If not, who initially spread that information? The first problem is the rumor detection issue, while the second is the rumor source detection problem. A rumor-detection problem involves identifying and mitigating false or misleading information spread via various communication channels, particularly online platforms and social media. Rumors can range from harmless ones to deliberately misleading content aimed at deceiving or manipulating audiences. Detecting misinformation is crucial for maintaining the integrity of information ecosystems and preventing harmful effects such as the spread of false beliefs, polarization, and even societal harm. Therefore, it is very important to quickly distinguish such misinformation while simultaneously finding its source to block it from spreading on the network. However, most of the existing surveys have analyzed these two issues separately. In this work, we first survey the existing research on the rumor-detection and rumor source detection problems with joint detection approaches, simultaneously. This survey deals with these two issues together so that their relationship can be observed and it provides how the two problems are similar and different. The limitations arising from the rumor detection, rumor source detection, and their combination problems are also explained, and some challenges to be addressed in future works are presented.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1528-1547"},"PeriodicalIF":7.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-25DOI: 10.1109/TBDATA.2024.3522816
Delong Ma;Ye Yuan;Yanfeng Zhang;Chunze Cao;Yuliang Ma
Counting triangles is an important topic in many practical applications, such as anomaly detection, community search, and recommendation systems. For triangle counting in large and dynamic graphs, recent work has focused on distributed streaming algorithms. These works assume that the graph is processed in the same location, while in reality, the graph stream may be generated and processed in datacenters that are geographically distributed. This raises new challenges to existing triangle counting algorithms, due to the multi-level heterogeneities in network bandwidth and communication prices in geo-distributed datacenters. In this article, we propose a cost-aware framework named ${sf GeoTri}$ based on the Master-Worker-Aggregator architecture, which takes both the cost and performance objectives into consideration for triangle counting in geo-distributed datacenters. The two core parts of this framework are the cost-aware nodes assignment strategy in master, which is critical to obtain node's position and distribute edges reasonably to reduce the cost (i.e., time cost and monetary cost), and cost-aware neighbor transfer strategy among workers, which further eliminates redundancy in data transfers. Additionally, we conduct extensive experiments on seven real-world graphs, and the results demonstrate that ${sf GeoTri}$ significantly lowers both runtime and monetary cost while exhibiting nice accuracy and scalability.
{"title":"Cost-Aware Triangle Counting Over Geo-Distributed Datacenters","authors":"Delong Ma;Ye Yuan;Yanfeng Zhang;Chunze Cao;Yuliang Ma","doi":"10.1109/TBDATA.2024.3522816","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3522816","url":null,"abstract":"Counting triangles is an important topic in many practical applications, such as anomaly detection, community search, and recommendation systems. For triangle counting in large and dynamic graphs, recent work has focused on distributed streaming algorithms. These works assume that the graph is processed in the same location, while in reality, the graph stream may be generated and processed in datacenters that are geographically distributed. This raises new challenges to existing triangle counting algorithms, due to the multi-level heterogeneities in network bandwidth and communication prices in geo-distributed datacenters. In this article, we propose a cost-aware framework named <inline-formula><tex-math>${sf GeoTri}$</tex-math></inline-formula> based on the Master-Worker-Aggregator architecture, which takes both the cost and performance objectives into consideration for triangle counting in geo-distributed datacenters. The two core parts of this framework are the cost-aware nodes assignment strategy in master, which is critical to obtain node's position and distribute edges reasonably to reduce the cost (i.e., time cost and monetary cost), and cost-aware neighbor transfer strategy among workers, which further eliminates redundancy in data transfers. Additionally, we conduct extensive experiments on seven real-world graphs, and the results demonstrate that <inline-formula><tex-math>${sf GeoTri}$</tex-math></inline-formula> significantly lowers both runtime and monetary cost while exhibiting nice accuracy and scalability.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"2008-2024"},"PeriodicalIF":7.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-13DOI: 10.1109/TBDATA.2024.3517313
Xinzhi Wang;Hang Yu;Jiayu Guo;Pengbo Li;Xiangfeng Luo
The mass volume of data in the modern business world requires fraud detection to be automated. Hence, some researchers constructed the fraud scenario into graph data and proposed graph-based fraud detection methods. These methods treat the problem of fraud detection as a binary node classification task. However, the differences between the nodes of the same class are ignored. In this paper, we try to distinguish differences in behavior among nodes of the same class to improve the model’s ability to detect deviation, i.e., we make a fine-grained classification of user behavior (called prototypes) and propose an adaptive prototype-based graph neural network (APGNN) for fraud detection. APGNN learns node behavior representations by extracting both neighborhood and global information, supplying preliminary knowledge for the adaptive creation of several prototypes, each representing a distinct behavior pattern. Subsequently, a new loss function is employed to enhance the prototypes’ capacity to capture these behavior patterns and to amplify the feature differences between different prototypes. Nodes are then projected onto these prototypes to derive the final behavior patterns. Extensive experiments on four real-world datasets show that this method can provide better fraud detection as well as a more understandable result.
{"title":"Towards Fraud Detection via Fine-Grained Classification of User Behavior","authors":"Xinzhi Wang;Hang Yu;Jiayu Guo;Pengbo Li;Xiangfeng Luo","doi":"10.1109/TBDATA.2024.3517313","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3517313","url":null,"abstract":"The mass volume of data in the modern business world requires fraud detection to be automated. Hence, some researchers constructed the fraud scenario into graph data and proposed graph-based fraud detection methods. These methods treat the problem of fraud detection as a binary node classification task. However, the differences between the nodes of the same class are ignored. In this paper, we try to distinguish differences in behavior among nodes of the same class to improve the model’s ability to detect deviation, i.e., we make a fine-grained classification of user behavior (called prototypes) and propose an adaptive prototype-based graph neural network (APGNN) for fraud detection. APGNN learns node behavior representations by extracting both neighborhood and global information, supplying preliminary knowledge for the adaptive creation of several prototypes, each representing a distinct behavior pattern. Subsequently, a new loss function is employed to enhance the prototypes’ capacity to capture these behavior patterns and to amplify the feature differences between different prototypes. Nodes are then projected onto these prototypes to derive the final behavior patterns. Extensive experiments on four real-world datasets show that this method can provide better fraud detection as well as a more understandable result.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"1994-2007"},"PeriodicalIF":7.5,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}