IEEE Transactions on Big Data最新文献_第2页

Efficiently Transfer User Profile Across Networks 有效地跨网络传输用户配置文件

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-06-13 DOI: 10.1109/TBDATA.2024.3414321

Mengting Diao;Zhongbao Zhang;Sen Su;Shuai Gao;Huafeng Cao;Junda Ye

User profiling has very important applications for many downstream tasks. Most existing methods only focus on modeling user profiles of one social network with plenty of data. However, user profiles are difficult to acquire, especially when the data is scarce. Fortunately, we observed that similar users have similar behavior patterns in different social networks. Motivated by such observations, in this paper, we for the first time propose to study the user profiling problem from the transfer learning perspective. We design two efficient frameworks for User Profile transferring acrOss Networks, i.e., UPON and E-UPON. In UPON, we first design a novel graph convolutional networks based characteristic-aware domain attention model to find user dependencies within and between domains (i.e., social networks). We then design a dual-domain weighted adversarial learning method to address the domain shift problem existing in the transferring procedure. In E-UPON, we optimize UPON in terms of computational complexity and memory. Specifically, we design a mini-cluster gradient descent based graph representation algorithm to shrink the searching space and ensure parallel computation. Then we use an adaptive cluster matching method to adjust the clusters of users. Experimental results on Twitter-Foursquare dataset demonstrate that UPON and E-UPON outperform the state-of-the-art models.

用户分析对于许多下游任务具有非常重要的应用。现有的方法大多只关注一个社交网络的大量数据的用户画像建模。然而，用户档案很难获得，特别是在数据稀缺的情况下。幸运的是，我们发现相似的用户在不同的社交网络中有着相似的行为模式。基于这些观察结果，本文首次提出从迁移学习的角度研究用户分析问题。我们设计了两种高效的跨网络用户配置文件传输框架，即UPON和E-UPON。在UPON中，我们首先设计了一种基于特征感知域注意力模型的新颖的图卷积网络，以发现域内和域之间的用户依赖关系（即社交网络）。然后设计了一种双域加权对抗学习方法来解决迁移过程中存在的域漂移问题。在E-UPON中，我们在计算复杂度和内存方面对UPON进行了优化。具体来说，我们设计了一种基于小聚类梯度下降的图表示算法，以缩小搜索空间并保证并行计算。然后采用自适应聚类匹配方法对用户聚类进行调整。在Twitter-Foursquare数据集上的实验结果表明，UPON和E-UPON优于最先进的模型。

{"title":"Efficiently Transfer User Profile Across Networks","authors":"Mengting Diao;Zhongbao Zhang;Sen Su;Shuai Gao;Huafeng Cao;Junda Ye","doi":"10.1109/TBDATA.2024.3414321","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3414321","url":null,"abstract":"User profiling has very important applications for many downstream tasks. Most existing methods only focus on modeling user profiles of one social network with plenty of data. However, user profiles are difficult to acquire, especially when the data is scarce. Fortunately, we observed that similar users have similar behavior patterns in different social networks. Motivated by such observations, in this paper, we for the first time propose to study the user profiling problem from the transfer learning perspective. We design two efficient frameworks for User Profile transferring acrOss Networks, i.e., UPON and E-UPON. In UPON, we first design a novel graph convolutional networks based characteristic-aware domain attention model to find user dependencies within and between domains (i.e., social networks). We then design a dual-domain weighted adversarial learning method to address the domain shift problem existing in the transferring procedure. In E-UPON, we optimize UPON in terms of computational complexity and memory. Specifically, we design a mini-cluster gradient descent based graph representation algorithm to shrink the searching space and ensure parallel computation. Then we use an adaptive cluster matching method to adjust the clusters of users. Experimental results on Twitter-Foursquare dataset demonstrate that UPON and E-UPON outperform the state-of-the-art models.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"271-285"},"PeriodicalIF":7.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Recognition of Cyberbullying in the Web of Things and social media using Deep Learning Framework 使用深度学习框架自动识别物联网和社交媒体中的网络欺凌

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-06-05 DOI: 10.1109/TBDATA.2024.3409939

Fahd N. Al-Wesabi;Marwa Obayya;Jamal Alsamri;Rana Alabdan;Nojood O Aljehane;Sana Alazwari;Fahad F. Alruwaili;Manar Ahmed Hamza;A Swathi

The Web of Things (WoT) is a network that facilitates the formation and distribution of information its users make. Young people nowadays, digital natives, have no trouble relating to others or joining groups online since they have grown up in a world where new technology has pushed communications to a nearly real-time level. Shared private messages, rumours, and sexual comments are all examples of online harassment that have led to several recent cases worldwide. Therefore, academics have been more interested in finding ways to recognise bullying conduct on these platforms. The effects of cyberbullying, a terrible form of online misbehaviour, are distressing. It takes several documents, but the text is predominant on social networks. Intelligent systems are required for the automatic detection of such occurrences. Most previous research has used standard machine-learning techniques to tackle this issue. The increasing pervasiveness of cyberbullying in WoT and other social media platforms is a significant cause for worry that calls for robust responses to prevent further harm. This study offers a unique method of leveraging the deep learning (DL) model binary coyote optimization-based Convolutional Neural Network (BCNN) in social networks to identify and classify cyberbullying. An essential part of this method is the combination of DL-based abuse detection and feature subset selection. To efficiently detect and address cases of cyberbullying via social media, the proposed system incorporates many crucial steps, including preprocessing, feature selection, and classification. A binary coyote optimization (BCO)-based feature subset selection method is presented to enhance classification efficiency. To improve the accuracy of cyberbullying categorization, the BCO algorithm efficiently chooses a selection of key characteristics. Cyberbullying must be tracked and classified across all internet channels, and Convolutional Neural Network (CNN) is constructed. With a best-case accuracy of 99.5% on Formspring, 99.7% on Twitter, and 99.3% on Wikipedia, the suggested algorithm successfully identified the vast majority of cyberbullying content.

物联网（WoT）是一个促进信息形成和分发的网络。如今的年轻人是数字原住民，他们与他人建立联系或加入网络团体没有任何困难，因为他们成长在一个新技术将通信推向几乎实时水平的世界。共享私人信息、谣言和性评论都是网络骚扰的例子，这些都导致了最近在世界范围内发生的几起案件。因此，学者们更感兴趣的是如何识别这些平台上的欺凌行为。网络欺凌是一种可怕的网络不当行为，其影响令人痛心。它需要几个文档，但文本在社交网络上占主导地位。需要智能系统来自动检测此类事件。之前的大多数研究都使用标准的机器学习技术来解决这个问题。WoT和其他社交媒体平台上越来越普遍的网络欺凌是一个令人担忧的重要原因，需要强有力的回应来防止进一步的伤害。本研究提供了一种独特的方法，利用深度学习（DL）模型基于二进制土狼优化的卷积神经网络（BCNN）在社交网络中识别和分类网络欺凌。该方法的关键部分是将基于dl的滥用检测与特征子集选择相结合。为了有效地检测和处理通过社交媒体的网络欺凌案件，所提出的系统包含了许多关键步骤，包括预处理、特征选择和分类。为了提高分类效率，提出了一种基于二元土狼优化（BCO）的特征子集选择方法。为了提高网络欺凌分类的准确性，BCO算法有效地选择了一组关键特征。网络欺凌必须跨越所有网络渠道进行跟踪和分类，为此构建了卷积神经网络（CNN）。该算法在Formspring上的最佳准确率为99.5%，在Twitter上为99.7%，在维基百科上为99.3%，成功识别了绝大多数网络欺凌内容。

{"title":"Automatic Recognition of Cyberbullying in the Web of Things and social media using Deep Learning Framework","authors":"Fahd N. Al-Wesabi;Marwa Obayya;Jamal Alsamri;Rana Alabdan;Nojood O Aljehane;Sana Alazwari;Fahad F. Alruwaili;Manar Ahmed Hamza;A Swathi","doi":"10.1109/TBDATA.2024.3409939","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3409939","url":null,"abstract":"The Web of Things (WoT) is a network that facilitates the formation and distribution of information its users make. Young people nowadays, digital natives, have no trouble relating to others or joining groups online since they have grown up in a world where new technology has pushed communications to a nearly real-time level. Shared private messages, rumours, and sexual comments are all examples of online harassment that have led to several recent cases worldwide. Therefore, academics have been more interested in finding ways to recognise bullying conduct on these platforms. The effects of cyberbullying, a terrible form of online misbehaviour, are distressing. It takes several documents, but the text is predominant on social networks. Intelligent systems are required for the automatic detection of such occurrences. Most previous research has used standard machine-learning techniques to tackle this issue. The increasing pervasiveness of cyberbullying in WoT and other social media platforms is a significant cause for worry that calls for robust responses to prevent further harm. This study offers a unique method of leveraging the deep learning (DL) model binary coyote optimization-based Convolutional Neural Network (BCNN) in social networks to identify and classify cyberbullying. An essential part of this method is the combination of DL-based abuse detection and feature subset selection. To efficiently detect and address cases of cyberbullying via social media, the proposed system incorporates many crucial steps, including preprocessing, feature selection, and classification. A binary coyote optimization (BCO)-based feature subset selection method is presented to enhance classification efficiency. To improve the accuracy of cyberbullying categorization, the BCO algorithm efficiently chooses a selection of key characteristics. Cyberbullying must be tracked and classified across all internet channels, and Convolutional Neural Network (CNN) is constructed. With a best-case accuracy of 99.5% on Formspring, 99.7% on Twitter, and 99.3% on Wikipedia, the suggested algorithm successfully identified the vast majority of cyberbullying content.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"259-270"},"PeriodicalIF":7.5,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Blockchain-Enabled Secure Collaborative Model Learning Using Differential Privacy for IoT-Based Big Data Analytics 基于物联网的大数据分析中使用差分隐私的区块链支持安全协作模型学习

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-04-29 DOI: 10.1109/TBDATA.2024.3394700

Prakash Tekchandani;Abhishek Bisht;Ashok Kumar Das;Neeraj Kumar;Marimuthu Karuppiah;Pandi Vijayakumar;Youngho Park

With the rise of Big data generated by Internet of Things (IoT) smart devices, there is an increasing need to leverage its potential while protecting privacy and maintaining confidentiality. Privacy and confidentiality in Big Data aims to enable data analysis and machine learning on large-scale datasets without compromising the dataset sensitive information. Usually current Big Data analytics models either efficiently achieves privacy or confidentiality. In this article, we aim to design a novel blockchain-enabled secured collaborative machine learning approach that provides privacy and confidentially on large scale datasets generated by IoT devices. Blockchain is used as secured platform to store and access data as well as to provide immutability and traceability. We also propose an efficient approach to obtain robust machine learning model through use of cryptographic techniques and differential privacy in which the data among involved parties is shared in a secured way while maintaining privacy and confidentiality of the data. The experimental evaluation along with security and performance analysis show that the proposed approach provides accuracy and scalability without compromising the privacy and security.

随着物联网（IoT）智能设备产生的大数据的兴起，在保护隐私和维护机密性的同时，越来越需要利用其潜力。大数据中的隐私和机密性旨在实现大规模数据集的数据分析和机器学习，而不会损害数据集的敏感信息。通常，当前的大数据分析模型要么有效地实现隐私，要么实现机密性。在本文中，我们的目标是设计一种新颖的支持区块链的安全协作机器学习方法，为物联网设备生成的大规模数据集提供隐私和机密性。区块链被用作存储和访问数据以及提供不变性和可追溯性的安全平台。我们还提出了一种有效的方法，通过使用加密技术和差分隐私来获得健壮的机器学习模型，其中相关方之间的数据以安全的方式共享，同时保持数据的隐私性和机密性。实验评估以及安全性和性能分析表明，该方法在不损害隐私和安全性的情况下具有准确性和可扩展性。

{"title":"Blockchain-Enabled Secure Collaborative Model Learning Using Differential Privacy for IoT-Based Big Data Analytics","authors":"Prakash Tekchandani;Abhishek Bisht;Ashok Kumar Das;Neeraj Kumar;Marimuthu Karuppiah;Pandi Vijayakumar;Youngho Park","doi":"10.1109/TBDATA.2024.3394700","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3394700","url":null,"abstract":"With the rise of Big data generated by Internet of Things (IoT) smart devices, there is an increasing need to leverage its potential while protecting privacy and maintaining confidentiality. Privacy and confidentiality in Big Data aims to enable data analysis and machine learning on large-scale datasets without compromising the dataset sensitive information. Usually current Big Data analytics models either efficiently achieves privacy or confidentiality. In this article, we aim to design a novel blockchain-enabled secured collaborative machine learning approach that provides privacy and confidentially on large scale datasets generated by IoT devices. Blockchain is used as secured platform to store and access data as well as to provide immutability and traceability. We also propose an efficient approach to obtain robust machine learning model through use of cryptographic techniques and differential privacy in which the data among involved parties is shared in a secured way while maintaining privacy and confidentiality of the data. The experimental evaluation along with security and performance analysis show that the proposed approach provides accuracy and scalability without compromising the privacy and security.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"141-156"},"PeriodicalIF":7.5,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient 基于稀疏和自适应随机梯度的高效通信分布式学习

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-03-30 DOI: 10.1109/TBDATA.2024.3407510

Xiaoge Deng;Dongsheng Li;Tao Sun;Xicheng Lu

Gradient-based optimization methods implemented on distributed computing architectures are increasingly used to tackle large-scale machine learning applications. A key bottleneck in such distributed systems is the high communication overhead for exchanging information, such as stochastic gradients, between workers. The inherent causes of this bottleneck are the frequent communication rounds and the full model gradient transmission in every round. In this study, we present SASG, a communication-efficient distributed algorithm that enjoys the advantages of sparse communication and adaptive aggregated stochastic gradients. By dynamically determining the workers who need to communicate through an adaptive aggregation rule and sparsifying the transmitted information, the SASG algorithm reduces both the overhead of communication rounds and the number of communication bits in the distributed system. For the theoretical analysis, we introduce an important auxiliary variable and define a new Lyapunov function to prove that the communication-efficient algorithm is convergent. The convergence result is identical to the sublinear rate of stochastic gradient descent, and our result also reveals that SASG scales well with the number of distributed workers. Finally, experiments on training deep neural networks demonstrate that the proposed algorithm can significantly reduce communication overhead compared to previous methods.

在分布式计算架构上实现的基于梯度的优化方法越来越多地用于处理大规模机器学习应用。这种分布式系统的一个关键瓶颈是工作人员之间交换信息（如随机梯度）的高通信开销。造成这一瓶颈的内在原因是频繁的通信轮次和每轮的全模型梯度传输。在本研究中，我们提出了一种通信高效的分布式算法SASG，它具有稀疏通信和自适应聚合随机梯度的优点。SASG算法通过自适应聚合规则动态确定需要通信的工作人员，并对传输的信息进行稀疏化，从而减少了分布式系统中通信轮数的开销和通信位的数量。在理论分析方面，我们引入了一个重要的辅助变量，并定义了一个新的Lyapunov函数来证明该算法是收敛的。收敛结果与随机梯度下降的次线性速率一致，并且我们的结果也表明，SASG随分布工作人员的数量有很好的扩展。最后，在训练深度神经网络上的实验表明，与之前的方法相比，该算法可以显著降低通信开销。

{"title":"Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient","authors":"Xiaoge Deng;Dongsheng Li;Tao Sun;Xicheng Lu","doi":"10.1109/TBDATA.2024.3407510","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3407510","url":null,"abstract":"Gradient-based optimization methods implemented on distributed computing architectures are increasingly used to tackle large-scale machine learning applications. A key bottleneck in such distributed systems is the high communication overhead for exchanging information, such as stochastic gradients, between workers. The inherent causes of this bottleneck are the frequent communication rounds and the full model gradient transmission in every round. In this study, we present SASG, a communication-efficient distributed algorithm that enjoys the advantages of sparse communication and adaptive aggregated stochastic gradients. By dynamically determining the workers who need to communicate through an adaptive aggregation rule and sparsifying the transmitted information, the SASG algorithm reduces both the overhead of communication rounds and the number of communication bits in the distributed system. For the theoretical analysis, we introduce an important auxiliary variable and define a new Lyapunov function to prove that the communication-efficient algorithm is convergent. The convergence result is identical to the sublinear rate of stochastic gradient descent, and our result also reveals that SASG scales well with the number of distributed workers. Finally, experiments on training deep neural networks demonstrate that the proposed algorithm can significantly reduce communication overhead compared to previous methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"234-246"},"PeriodicalIF":7.5,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity Quantification CompanyKG：公司相似性量化的大规模异质图

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-03-30 DOI: 10.1109/TBDATA.2024.3407573

Lele Cao;Vilhelm von Ehrenheim;Mark Granroth-Wilding;Richard Anselmo Stahl;Andrew McCornack;Armin Catovic;Dhiana Deva Cavalcanti Rocha

In the investment industry, it is often essential to carry out fine-grained company similarity quantification for a range of purposes, including market mapping, competitor analysis, and mergers and acquisitions. We propose and publish a knowledge graph, named CompanyKG, to represent and learn diverse company features and relations. Specifically, 1.17 million companies are represented as nodes enriched with company description embeddings; and 15 different inter-company relations result in 51.06 million weighted edges. To enable a comprehensive assessment of methods for company similarity quantification, we have devised and compiled three evaluation tasks with annotated test sets: similarity prediction, competitor retrieval and similarity ranking. We present extensive benchmarking results for 11 reproducible predictive methods categorized into three groups: node-only, edge-only, and node+edge. To the best of our knowledge, CompanyKG is the first large-scale heterogeneous graph dataset originating from a real-world investment platform, tailored for quantifying inter-company similarity.

在投资行业中，为了一系列目的（包括市场映射、竞争对手分析以及合并和收购），执行细粒度的公司相似性量化通常是必要的。我们提出并发布了一个名为CompanyKG的知识图谱，来表达和了解公司的各种特征和关系。具体来说，117万家公司被表示为富含公司描述嵌入的节点；15种不同的公司间关系产生了5106万条加权边。为了全面评估公司相似度量化方法，我们设计并编制了三个带有注释测试集的评估任务：相似度预测、竞争对手检索和相似度排名。我们为11种可重复的预测方法提供了广泛的基准测试结果，这些方法分为三组：仅节点、仅边缘和节点+边缘。据我们所知，CompanyKG是第一个源自真实世界投资平台的大规模异构图形数据集，专为量化公司间相似性而定制。

{"title":"CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity Quantification","authors":"Lele Cao;Vilhelm von Ehrenheim;Mark Granroth-Wilding;Richard Anselmo Stahl;Andrew McCornack;Armin Catovic;Dhiana Deva Cavalcanti Rocha","doi":"10.1109/TBDATA.2024.3407573","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3407573","url":null,"abstract":"In the investment industry, it is often essential to carry out fine-grained company similarity quantification for a range of purposes, including market mapping, competitor analysis, and mergers and acquisitions. We propose and publish a knowledge graph, named CompanyKG, to represent and learn diverse company features and relations. Specifically, 1.17 million companies are represented as nodes enriched with company description embeddings; and 15 different inter-company relations result in 51.06 million weighted edges. To enable a comprehensive assessment of methods for company similarity quantification, we have devised and compiled three evaluation tasks with annotated test sets: similarity prediction, competitor retrieval and similarity ranking. We present extensive benchmarking results for 11 reproducible predictive methods categorized into three groups: node-only, edge-only, and node+edge. To the best of our knowledge, CompanyKG is the first large-scale heterogeneous graph dataset originating from a real-world investment platform, tailored for quantifying inter-company similarity.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"247-258"},"PeriodicalIF":7.5,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

3A Multi-Classification Division-Aggregation Framework for Fake News Detection 虚假新闻检测的多分类划分聚合框架

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-03-26 DOI: 10.1109/TBDATA.2024.3378098

Wen Zhang;Haitao Fu;Huan Wang;Zhiguo Gong;Pan Zhou;Di Wang

Nowadays, as human activities are shifting to social media, fake news detection has been a crucial problem. Existing methods ignore the classification difference in online news and cannot take full advantage of multi-classification knowledges. For example, when coping with a post “A mouse is frightened by a cat,” a model that learns “computer” knowledge tends to misunderstand “mouse” and give a fake label, but a model that learns “animal” knowledge tends to give a true label. Therefore, this research proposes a multi-classification division-aggregation framework to detect fake news, named

$CKA$

, which innovatively learns classification knowledges during training stages and aggregates them during prediction stages. It consists of three main components: a news characterizer, an ensemble coordinator, and a truth predictor. The news characterizer is responsible for extracting news features and obtaining news classifications. Cooperating with the news characterizer, the ensemble coordinator generates classification-specifical models for the maximum reservation of classification knowledges during the training stage, where each classification-specifical model maximizes the detection performance of fake news on corresponding news classifications. Further, to aggregate the classification knowledges during the prediction stage, the truth predictor uses the truth discovery technology to aggregate the predictions from different classification-specifical models based on reliability evaluation of classification-specifical models. Extensive experiments prove that our proposed

$CKA$

outperforms state-of-the-art baselines in fake news detection.

如今，随着人类活动向社交媒体转移，假新闻检测已经成为一个关键问题。现有的方法忽略了网络新闻的分类差异，不能充分利用多分类知识。例如，在处理“一只老鼠被一只猫吓坏了”的帖子时，学习“计算机”知识的模型往往会误解“老鼠”并给出假标签，而学习“动物”知识的模型往往会给出真标签。因此，本研究提出了一种多分类的分类聚合框架，命名为$CKA$，该框架创新性地在训练阶段学习分类知识，并在预测阶段对分类知识进行聚合。它由三个主要部分组成：新闻特征描述器、集成协调器和真相预测器。新闻特征器负责提取新闻特征，获得新闻分类。集成协调器与新闻特征器合作，在训练阶段生成分类特定模型，以最大限度地保留分类知识，其中每个分类特定模型在相应的新闻分类上最大化假新闻的检测性能。此外，为了在预测阶段对分类知识进行聚合，基于分类特定模型的可靠性评估，真值预测器使用真值发现技术对来自不同分类特定模型的预测进行聚合。大量实验证明，我们提出的$CKA$在假新闻检测方面优于最先进的基线。

{"title":"3A Multi-Classification Division-Aggregation Framework for Fake News Detection","authors":"Wen Zhang;Haitao Fu;Huan Wang;Zhiguo Gong;Pan Zhou;Di Wang","doi":"10.1109/TBDATA.2024.3378098","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3378098","url":null,"abstract":"Nowadays, as human activities are shifting to social media, fake news detection has been a crucial problem. Existing methods ignore the classification difference in online news and cannot take full advantage of multi-classification knowledges. For example, when coping with a post “A mouse is frightened by a cat,” a model that learns “computer” knowledge tends to misunderstand “mouse” and give a fake label, but a model that learns “animal” knowledge tends to give a true label. Therefore, this research proposes a multi-classification division-aggregation framework to detect fake news, named <inline-formula><tex-math>$CKA$</tex-math></inline-formula>, which innovatively learns classification knowledges during training stages and aggregates them during prediction stages. It consists of three main components: a news characterizer, an ensemble coordinator, and a truth predictor. The news characterizer is responsible for extracting news features and obtaining news classifications. Cooperating with the news characterizer, the ensemble coordinator generates classification-specifical models for the maximum reservation of classification knowledges during the training stage, where each classification-specifical model maximizes the detection performance of fake news on corresponding news classifications. Further, to aggregate the classification knowledges during the prediction stage, the truth predictor uses the truth discovery technology to aggregate the predictions from different classification-specifical models based on reliability evaluation of classification-specifical models. Extensive experiments prove that our proposed <inline-formula><tex-math>$CKA$</tex-math></inline-formula> outperforms state-of-the-art baselines in fake news detection.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"130-140"},"PeriodicalIF":7.5,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous Social Event Detection via Hyperbolic Graph Representations 基于双曲图表示的异质社会事件检测

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-03-22 DOI: 10.1109/TBDATA.2024.3381017

Zitai Qiu;Jia Wu;Jian Yang;Xing Su;Charu Aggarwal

Social events reflect the dynamics of society and, here, natural disasters and emergencies receive significant attention. The timely detection of these events can provide organisations and individuals with valuable information to reduce or avoid losses. However, due to the complex heterogeneities of the content and structure of social media, existing models can only learn limited information; large amounts of semantic and structural information are ignored. In addition, due to high labour costs, it is rare for social media datasets to include high-quality labels, which also makes it challenging for models to learn information from social media. In this study, we propose two hyperbolic graph representation-based methods for detecting social events from heterogeneous social media environments. For cases where a dataset has labels, we design a Hyperbolic Social Event Detection (HSED) model that converts complex social information into a unified social message graph. This model addresses the heterogeneity of social media, and, with this graph, the information in social media can be used to capture structural information based on the properties of hyperbolic space. For cases where the dataset is unlabelled, we design an Unsupervised Hyperbolic Social Event Detection (UHSED). This model is based on the HSED model but includes graph contrastive learning to make it work in unlabelled scenarios. Extensive experiments demonstrate the superiority of the proposed approaches.

社会事件反映了社会的动态，在这里，自然灾害和紧急情况受到极大的关注。及时发现这些事件可以为组织和个人提供有价值的信息，以减少或避免损失。然而，由于社交媒体的内容和结构具有复杂的异质性，现有模型只能学习有限的信息；大量的语义和结构信息被忽略了。此外，由于人工成本高，社交媒体数据集很少包含高质量的标签，这也给模型从社交媒体中学习信息带来了挑战。在这项研究中，我们提出了两种基于双曲图表示的方法来检测异构社交媒体环境中的社会事件。对于数据集有标签的情况，我们设计了一个双曲社会事件检测（HSED）模型，该模型将复杂的社会信息转换为统一的社会消息图。该模型解决了社交媒体的异质性问题，通过该图，社交媒体中的信息可以基于双曲空间的属性来捕获结构信息。对于数据集未标记的情况，我们设计了一种无监督双曲社会事件检测（UHSED）。该模型基于HSED模型，但包括图对比学习，使其在未标记的场景下工作。大量的实验证明了所提方法的优越性。

{"title":"Heterogeneous Social Event Detection via Hyperbolic Graph Representations","authors":"Zitai Qiu;Jia Wu;Jian Yang;Xing Su;Charu Aggarwal","doi":"10.1109/TBDATA.2024.3381017","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3381017","url":null,"abstract":"Social events reflect the dynamics of society and, here, natural disasters and emergencies receive significant attention. The timely detection of these events can provide organisations and individuals with valuable information to reduce or avoid losses. However, due to the complex heterogeneities of the content and structure of social media, existing models can only learn limited information; large amounts of semantic and structural information are ignored. In addition, due to high labour costs, it is rare for social media datasets to include high-quality labels, which also makes it challenging for models to learn information from social media. In this study, we propose two hyperbolic graph representation-based methods for detecting social events from heterogeneous social media environments. For cases where a dataset has labels, we design a <bold>Hyperbolic <bold>Social <bold>Event <bold>Detection (HSED) model that converts complex social information into a unified social message graph. This model addresses the heterogeneity of social media, and, with this graph, the information in social media can be used to capture structural information based on the properties of hyperbolic space. For cases where the dataset is unlabelled, we design an <bold>Unsupervised <bold>Hyperbolic <bold>Social <bold>Event <bold>Detection (UHSED). This model is based on the HSED model but includes graph contrastive learning to make it work in unlabelled scenarios. Extensive experiments demonstrate the superiority of the proposed approaches.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"115-129"},"PeriodicalIF":7.5,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TAMT: Privacy-Preserving Task Assignment With Multi-Threshold Range Search for Spatial Crowdsourcing Applications 空间众包应用中具有多阈值范围搜索的隐私保护任务分配

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-03-20 DOI: 10.1109/TBDATA.2024.3403374

Haiyong Bao;Zhehong Wang;Rongxing Lu;Cheng Huang;Beibei Li

Spatial crowdsourcing is a distributed computing paradigm that utilizes the collective intelligence of workers to perform complex tasks. How to achieve privacy-preserving task assignment in spatial crowdsourcing applications has been a popular research area. However, most of the existing task assignment schemes may reveal private and sensitive information of tasks or workers. Few schemes can support task assignment based on different attributes simultaneously, such as spatial, interest, etc. To study the above themes, in this paper, we propose one privacy-preserving task assignment scheme with multi-threshold range search for spatial crowdsourcing applications (TAMT). Specifically, we first define euclidean distance-based location search and Hamming distance-based interest search, which map the demands of the tasks and the interests of the workers into the binary vectors. Second, we deploy PKD-tree to index the task data leveraging the pivoting techniques and the triangular inequality of euclidean distance, and propose an efficient multi-threshold range search algorithm based on matrix encryption and decomposition technology. Furthermore, based on DT-PKC, we introduce a ciphertext-based secure comparison protocol to support multi-threshold range search for spatial crowdsourcing applications. Finally, comprehensive security analysis proves that our proposed TAMT is privacy-preserving. Meanwhile, theoretical analysis and experimental evaluation demonstrate that TAMT is practical and efficient.

空间众包是一种分布式计算范例，它利用工人的集体智慧来执行复杂的任务。如何在空间众包应用中实现保护隐私的任务分配一直是研究的热点。然而，大多数现有的任务分配方案可能会泄露任务或工人的隐私和敏感信息。很少有方案能够同时支持基于不同属性的任务分配，如空间、兴趣等。为了研究上述主题，本文提出了一种基于多阈值范围搜索的空间众包应用隐私保护任务分配方案。具体来说，我们首先定义了基于欧几里得距离的位置搜索和基于汉明距离的兴趣搜索，它们将任务的需求和工人的兴趣映射到二值向量中。其次，利用旋转技术和欧氏距离的三角不等式，利用pkd树对任务数据进行索引，提出了一种基于矩阵加密和分解技术的高效多阈值范围搜索算法。此外，在DT-PKC的基础上，我们引入了一种基于密文的安全比较协议，支持空间众包应用的多阈值范围搜索。最后，综合安全性分析证明了我们提出的TAMT是隐私保护的。同时，理论分析和实验评价证明了该方法的实用性和有效性。

{"title":"TAMT: Privacy-Preserving Task Assignment With Multi-Threshold Range Search for Spatial Crowdsourcing Applications","authors":"Haiyong Bao;Zhehong Wang;Rongxing Lu;Cheng Huang;Beibei Li","doi":"10.1109/TBDATA.2024.3403374","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3403374","url":null,"abstract":"Spatial crowdsourcing is a distributed computing paradigm that utilizes the collective intelligence of workers to perform complex tasks. How to achieve privacy-preserving task assignment in spatial crowdsourcing applications has been a popular research area. However, most of the existing task assignment schemes may reveal private and sensitive information of tasks or workers. Few schemes can support task assignment based on different attributes simultaneously, such as spatial, interest, etc. To study the above themes, in this paper, we propose one privacy-preserving task assignment scheme with multi-threshold range search for spatial crowdsourcing applications (TAMT). Specifically, we first define euclidean distance-based location search and Hamming distance-based interest search, which map the demands of the tasks and the interests of the workers into the binary vectors. Second, we deploy PKD-tree to index the task data leveraging the pivoting techniques and the triangular inequality of euclidean distance, and propose an efficient multi-threshold range search algorithm based on matrix encryption and decomposition technology. Furthermore, based on DT-PKC, we introduce a ciphertext-based secure comparison protocol to support multi-threshold range search for spatial crowdsourcing applications. Finally, comprehensive security analysis proves that our proposed TAMT is privacy-preserving. Meanwhile, theoretical analysis and experimental evaluation demonstrate that TAMT is practical and efficient.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"208-220"},"PeriodicalIF":7.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Universal and Efficient Multi-Modal Smart Contract Vulnerability Detection Framework for Big Data 面向大数据的通用高效多模态智能合约漏洞检测框架

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-03-20 DOI: 10.1109/TBDATA.2024.3403376

Wenjuan Lian;Zikang Bao;Xinze Zhang;Bin Jia;Yang Zhang

A vulnerability or error in a smart contract will lead to serious consequences including loss of assets and leakage of user privacy. Established smart contract vulnerability detection tools define vulnerabilities through symbolic execution, fuzz testing, and other methods requiring extremely specialized security knowledge. Even so, with the development of vulnerability exploitation techniques, vulnerability detection tools customized by experts cannot cope with the deformation of existing vulnerabilities or unknown vulnerabilities. The vulnerability detection based on machine learning developed in recent years studies vulnerabilities from different dimensions and designs corresponding models to achieve a high detection rate. However, these methods usually only focus on some features of smart contracts, or the model itself does not have universality. Experimental results on the publicly large-scale dataset SmartBugs-Wild demonstrate that this paper's method not only outperforms existing methods in several metrics, but also is scalable, general, and requires less domain knowledge, providing a new idea for the development of smart contract vulnerability detection.

智能合约中的漏洞或错误将导致严重的后果，包括资产损失和用户隐私泄露。现有的智能合约漏洞检测工具通过符号执行、模糊测试和其他需要极其专业的安全知识的方法来定义漏洞。尽管如此，随着漏洞利用技术的发展，专家定制的漏洞检测工具无法应对现有漏洞或未知漏洞的变形。近年来发展起来的基于机器学习的漏洞检测从不同的维度对漏洞进行研究，并设计相应的模型，以达到较高的检测率。然而，这些方法通常只关注智能合约的某些特征，或者模型本身不具有通用性。在公开大规模数据集SmartBugs-Wild上的实验结果表明，本文方法不仅在多个指标上优于现有方法，而且具有可扩展性、通用性和较少的领域知识要求，为智能合约漏洞检测的发展提供了新的思路。

{"title":"A Universal and Efficient Multi-Modal Smart Contract Vulnerability Detection Framework for Big Data","authors":"Wenjuan Lian;Zikang Bao;Xinze Zhang;Bin Jia;Yang Zhang","doi":"10.1109/TBDATA.2024.3403376","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3403376","url":null,"abstract":"A vulnerability or error in a smart contract will lead to serious consequences including loss of assets and leakage of user privacy. Established smart contract vulnerability detection tools define vulnerabilities through symbolic execution, fuzz testing, and other methods requiring extremely specialized security knowledge. Even so, with the development of vulnerability exploitation techniques, vulnerability detection tools customized by experts cannot cope with the deformation of existing vulnerabilities or unknown vulnerabilities. The vulnerability detection based on machine learning developed in recent years studies vulnerabilities from different dimensions and designs corresponding models to achieve a high detection rate. However, these methods usually only focus on some features of smart contracts, or the model itself does not have universality. Experimental results on the publicly large-scale dataset SmartBugs-Wild demonstrate that this paper's method not only outperforms existing methods in several metrics, but also is scalable, general, and requires less domain knowledge, providing a new idea for the development of smart contract vulnerability detection.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"190-207"},"PeriodicalIF":7.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Real-Time Network Intrusion Detection With Image-Based Sequential Packets Representation 基于图像序列数据包表示的实时网络入侵检测

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-03-20 DOI: 10.1109/TBDATA.2024.3403394

Jalal Ghadermazi;Ankit Shah;Nathaniel D. Bastian

Machine learning (ML) and deep learning (DL) advancements have greatly enhanced anomaly detection of network intrusion detection systems (NIDS) by empowering them to analyze Big Data and extract patterns. ML/DL-based NIDS are trained using either flow-based or packet-based features. Flow-based NIDS are suitable for offline traffic analysis, while packet-based NIDS can analyze traffic and detect attacks in real-time. Current packet-based approaches analyze packets independently, overlooking the sequential nature of network communication. This results in biased models that exhibit increased false negatives and positives. Additionally, most literature-proposed packet-based NIDS capture only payload data, neglecting crucial information from packet headers. This oversight can impair the ability to identify header-level attacks, such as denial-of-service attacks. To address these limitations, we propose a novel artificial intelligence-enabled methodological framework for packet-based NIDS that effectively analyzes header and payload data and considers temporal connections among packets. Our framework transforms sequential packets into two-dimensional images. It then develops a convolutional neural network-based intrusion detection model to process these images and detect malicious activities. Through experiments using publicly available big datasets, we demonstrate that our framework is able to achieve high detection rates of 97.7% to 99% across different attack types and displays promising resilience against adversarial examples.

机器学习（ML）和深度学习（DL）的进步极大地增强了网络入侵检测系统（NIDS）的异常检测能力，使它们能够分析大数据并提取模式。基于ML/ dl的NIDS使用基于流或基于包的特征进行训练。基于流量的入侵检测适合于离线的流量分析，而基于报文的入侵检测可以实时分析流量和检测攻击。当前基于数据包的方法独立分析数据包，忽略了网络通信的顺序性。这导致有偏见的模型，表现出更多的假阴性和假阳性。此外，大多数文献提出的基于包的NIDS仅捕获有效负载数据，而忽略了包头中的关键信息。这种疏忽会损害识别报头级攻击的能力，例如拒绝服务攻击。为了解决这些限制，我们为基于数据包的NIDS提出了一种新的支持人工智能的方法框架，该框架可以有效地分析报头和有效载荷数据，并考虑数据包之间的时间连接。我们的框架将顺序数据包转换为二维图像。然后开发了一个基于卷积神经网络的入侵检测模型来处理这些图像并检测恶意活动。通过使用公开可用的大数据集的实验，我们证明了我们的框架能够在不同的攻击类型中实现97.7%到99%的高检测率，并且对对抗性示例显示出有希望的弹性。

{"title":"Towards Real-Time Network Intrusion Detection With Image-Based Sequential Packets Representation","authors":"Jalal Ghadermazi;Ankit Shah;Nathaniel D. Bastian","doi":"10.1109/TBDATA.2024.3403394","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3403394","url":null,"abstract":"Machine learning (ML) and deep learning (DL) advancements have greatly enhanced anomaly detection of network intrusion detection systems (NIDS) by empowering them to analyze Big Data and extract patterns. ML/DL-based NIDS are trained using either flow-based or packet-based features. Flow-based NIDS are suitable for offline traffic analysis, while packet-based NIDS can analyze traffic and detect attacks in real-time. Current packet-based approaches analyze packets independently, overlooking the sequential nature of network communication. This results in biased models that exhibit increased false negatives and positives. Additionally, most literature-proposed packet-based NIDS capture only payload data, neglecting crucial information from packet headers. This oversight can impair the ability to identify header-level attacks, such as denial-of-service attacks. To address these limitations, we propose a novel artificial intelligence-enabled methodological framework for packet-based NIDS that effectively analyzes header and payload data and considers temporal connections among packets. Our framework transforms sequential packets into two-dimensional images. It then develops a convolutional neural network-based intrusion detection model to process these images and detect malicious activities. Through experiments using publicly available big datasets, we demonstrate that our framework is able to achieve high detection rates of 97.7% to 99% across different attack types and displays promising resilience against adversarial examples.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"157-173"},"PeriodicalIF":7.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10535236","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0