IEEE Transactions on Big Data最新文献_第8页

Scalable Evidential K-Nearest Neighbor Classification on Big Data 海量数据的可扩展证据 K 近邻分类

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-10-24 DOI: 10.1109/TBDATA.2023.3327220

Chaoyu Gong;Jim Demmel;Yang You

The K-Nearest Neighbor (K-NN) algorithm has garnered widespread utilization in real-world scenarios, due to its exceptional interpretability that other classification algorithms may not have. The evidential K-NN (EK-NN) algorithm builds upon the same nearest neighbor search procedure as K-NN, and provides more informative classification outcomes. However, EK-NN is not practical for Big Data because it is computationally complex. First, the search for K nearest neighbors of test samples from

$n$

training samples requires

$O(n^{2})$

operations. Additionally, estimating parameters involves performing complicated matrix calculations that increase in scale as the dataset becomes larger. To address these issues, we propose two scalable EK-NN classifiers, Global Exact EK-NN and Local Approximate EK-NN, under the distributed Spark framework. Along with the Local Approximate EK-NN, a new distributed gradient descent algorithm is developed to learn parameters. Data parallelism is used to reduce negative impacts caused by data distribution differences. Experimental results show that Our algorithms are able to achieve state-of-the-art scaling efficiency and accuracy on large datasets with more than 10 million samples.

K-Nearest Neighbor（K-NN）算法在现实世界中得到了广泛应用，因为它具有其他分类算法可能不具备的出色的可解释性。证据 K-NN（EK-NN）算法基于与 K-NN 相同的近邻搜索程序，可提供信息量更大的分类结果。但是，EK-NN 在大数据中并不实用，因为它在计算上非常复杂。首先，从 $n$ 训练样本中搜索测试样本的 K 个近邻需要 $O(n^{2})$ 的操作。此外，参数估计涉及复杂的矩阵计算，随着数据集的增大，计算量也会增加。为了解决这些问题，我们在分布式 Spark 框架下提出了两种可扩展的 EK-NN 分类器：全局精确 EK-NN 和局部近似 EK-NN。除了本地近似 EK-NN 之外，我们还开发了一种新的分布式梯度下降算法来学习参数。数据并行性用于减少数据分布差异造成的负面影响。实验结果表明，我们的算法能够在超过 1 千万样本的大型数据集上实现最先进的扩展效率和准确性。

{"title":"Scalable Evidential K-Nearest Neighbor Classification on Big Data","authors":"Chaoyu Gong;Jim Demmel;Yang You","doi":"10.1109/TBDATA.2023.3327220","DOIUrl":"10.1109/TBDATA.2023.3327220","url":null,"abstract":"The \u0000K\u0000-Nearest Neighbor (K-NN) algorithm has garnered widespread utilization in real-world scenarios, due to its exceptional interpretability that other classification algorithms may not have. The evidential K-NN (EK-NN) algorithm builds upon the same nearest neighbor search procedure as K-NN, and provides more informative classification outcomes. However, EK-NN is not practical for Big Data because it is computationally complex. First, the search for \u0000K\u0000 nearest neighbors of test samples from \u0000<inline-formula><tex-math>$n$</tex-math></inline-formula>\u0000 training samples requires \u0000<inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula>\u0000 operations. Additionally, estimating parameters involves performing complicated matrix calculations that increase in scale as the dataset becomes larger. To address these issues, we propose two scalable EK-NN classifiers, Global Exact EK-NN and Local Approximate EK-NN, under the distributed Spark framework. Along with the Local Approximate EK-NN, a new distributed gradient descent algorithm is developed to learn parameters. Data parallelism is used to reduce negative impacts caused by data distribution differences. Experimental results show that Our algorithms are able to achieve state-of-the-art scaling efficiency and accuracy on large datasets with more than 10 million samples.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"226-237"},"PeriodicalIF":7.2,"publicationDate":"2023-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135158255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptively-Accelerated Parallel Stochastic Gradient Descent for High-Dimensional and Incomplete Data Representation Learning 用于高维和不完整数据表示学习的自适应加速并行随机梯度下降技术

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-10-23 DOI: 10.1109/TBDATA.2023.3326304

Wen Qin;Xin Luo;MengChu Zhou

High-dimensional and incomplete (HDI) interactions among numerous nodes are commonly encountered in a Big Data-related application, like user-item interactions in a recommender system. Owing to its high efficiency and flexibility, a stochastic gradient descent (SGD) algorithm can enable efficient latent feature analysis (LFA) of HDI data for its precise representation, thereby enabling efficient solutions to knowledge acquisition issues like missing data estimation. However, LFA on HDI data involves a bilinear issue, making SGD-based LFA a sequential process, i.e., the update on a feature can impact the results on the others. Intervening the sequence of SGD-based LFA on HDI data can affect the training results. Therefore, a parallel SGD algorithm to LFA should be designed with care. Existing parallel SGD-based LFA models suffer from a) low parallelization degree, and b) slow convergence, which significantly restrict their scalability. Aiming at addressing these vital issues, this paper presents an Adaptively-accelerated Parallel Stochastic Gradient Descent (AP-SGD) algorithm to LFA by: a) establishing a novel local minimum-based data splitting and scheduling scheme to reduce the scheduling cost among threads, thereby achieving high parallelization degree; and b) incorporating the adaptive momentum method into the learning scheme, thereby accelerating the convergence rate by making the learning rate and acceleration coefficient self-adaptive. The convergence of the achieved AP-SGD-based LFA model is theoretically proved. Experimental results on three HDI matrices generated by real industrial applications demonstrate that the AP-SGD-based LFA model outperforms state-of-the-art parallel SGD-based LFA models in both estimation accuracy for missing data and parallelization degree. Hence, it has the potential for efficient representation of HDI data in industrial scenes.

在大数据相关应用（如推荐系统中的用户-项目交互）中，通常会遇到众多节点之间的高维和不完整（HDI）交互。随机梯度下降算法（SGD）具有高效和灵活的特点，可以对 HDI 数据进行高效的潜在特征分析（LFA），以精确地表示 HDI 数据，从而有效地解决缺失数据估计等知识获取问题。然而，HDI 数据的 LFA 涉及双线性问题，这使得基于 SGD 的 LFA 成为一个顺序过程，即一个特征的更新会影响其他特征的结果。在 HDI 数据上干预基于 SGD 的 LFA 的顺序会影响训练结果。因此，应谨慎设计 LFA 的并行 SGD 算法。现有的基于 SGD 的并行 LFA 模型存在以下问题：a）并行化程度低；b）收敛速度慢，这极大地限制了其可扩展性。为了解决这些重要问题，本文提出了一种针对 LFA 的自适应加速并行随机梯度下降算法（AP-SGD）：a）建立一种新颖的基于局部最小值的数据分割和调度方案，以降低线程间的调度成本，从而实现高并行化程度；b）将自适应动量法纳入学习方案，通过使学习速率和加速系数自适应来加快收敛速度。理论证明了所实现的基于 AP-SGD 的 LFA 模型的收敛性。在实际工业应用中生成的三个 HDI 矩阵上的实验结果表明，基于 AP-SGD 的 LFA 模型在缺失数据估计精度和并行化程度上都优于最先进的基于 SGD 的并行 LFA 模型。因此，它具有高效表示工业场景中 HDI 数据的潜力。

{"title":"Adaptively-Accelerated Parallel Stochastic Gradient Descent for High-Dimensional and Incomplete Data Representation Learning","authors":"Wen Qin;Xin Luo;MengChu Zhou","doi":"10.1109/TBDATA.2023.3326304","DOIUrl":"10.1109/TBDATA.2023.3326304","url":null,"abstract":"High-dimensional and incomplete (HDI) interactions among numerous nodes are commonly encountered in a Big Data-related application, like user-item interactions in a recommender system. Owing to its high efficiency and flexibility, a stochastic gradient descent (SGD) algorithm can enable efficient latent feature analysis (LFA) of HDI data for its precise representation, thereby enabling efficient solutions to knowledge acquisition issues like missing data estimation. However, LFA on HDI data involves a bilinear issue, making SGD-based LFA a sequential process, i.e., the update on a feature can impact the results on the others. Intervening the sequence of SGD-based LFA on HDI data can affect the training results. Therefore, a parallel SGD algorithm to LFA should be designed with care. Existing parallel SGD-based LFA models suffer from a) low parallelization degree, and b) slow convergence, which significantly restrict their scalability. Aiming at addressing these vital issues, this paper presents an \u0000<underline>A\u0000daptively-accelerated \u0000<underline>P\u0000arallel \u0000<underline>S\u0000tochastic \u0000<underline>G\u0000radient \u0000<underline>D\u0000escent (AP-SGD) algorithm to LFA by: a) establishing a novel local minimum-based data splitting and scheduling scheme to reduce the scheduling cost among threads, thereby achieving high parallelization degree; and b) incorporating the adaptive momentum method into the learning scheme, thereby accelerating the convergence rate by making the learning rate and acceleration coefficient self-adaptive. The convergence of the achieved AP-SGD-based LFA model is theoretically proved. Experimental results on three HDI matrices generated by real industrial applications demonstrate that the AP-SGD-based LFA model outperforms state-of-the-art parallel SGD-based LFA models in both estimation accuracy for missing data and parallelization degree. Hence, it has the potential for efficient representation of HDI data in industrial scenes.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"92-107"},"PeriodicalIF":7.2,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135107835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Federated Convolution Transformer for Fake News Detection 用于假新闻检测的联合卷积变换器

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-10-18 DOI: 10.1109/TBDATA.2023.3325746

Youcef Djenouri;Ahmed Nabil Belbachir;Tomasz Michalak;Gautam Srivastava

We present a novel approach to detect fake news in Internet of Things (IoT) applications. By investigating federated learning and trusted authority methods, we address the issue of data security during training. Simultaneously, by investigating convolution transformers and user clustering, we deal with multi-modality issues in fake news data. First, we use dense embedding and the k-means algorithm to cluster users into groups that are similar to one another. We then develop a local model for each user using their local data. The server then receives the local models of users along with clustering information, and a trusted authority verifies their integrity there. We use two different types of aggregation in place of conventional federated learning systems. The initial step is to combine all users’ models to create a single global model. The second step entails compiling each user's model into a local model of comparable users. Both models are supplied to users, who then select the most suitable model for identifying fake news. By conducting extensive experiments using Twitter data, we demonstrate that the proposed method outperforms various baselines, where it achieves an average accuracy of 0.85 in comparison to others that do not exceed 0.81.

我们提出了一种在物联网（IoT）应用中检测假新闻的新方法。通过研究联合学习和可信权威方法，我们解决了训练过程中的数据安全问题。同时，通过研究卷积变换器和用户聚类，我们解决了假新闻数据中的多模态问题。首先，我们使用密集嵌入和 k-means 算法将用户聚类为彼此相似的群体。然后，我们利用每个用户的本地数据为其开发一个本地模型。然后，服务器会收到用户的本地模型和聚类信息，并由可信机构在此验证其完整性。我们使用两种不同的聚合方式来替代传统的联合学习系统。第一步是合并所有用户的模型，创建一个全局模型。第二步是将每个用户的模型编译成可比用户的本地模型。两个模型都提供给用户，然后用户选择最合适的模型来识别假新闻。通过使用 Twitter 数据进行大量实验，我们证明了所提出的方法优于各种基线方法，其平均准确率达到了 0.85，而其他方法的准确率不超过 0.81。

{"title":"A Federated Convolution Transformer for Fake News Detection","authors":"Youcef Djenouri;Ahmed Nabil Belbachir;Tomasz Michalak;Gautam Srivastava","doi":"10.1109/TBDATA.2023.3325746","DOIUrl":"10.1109/TBDATA.2023.3325746","url":null,"abstract":"We present a novel approach to detect fake news in Internet of Things (IoT) applications. By investigating federated learning and trusted authority methods, we address the issue of data security during training. Simultaneously, by investigating convolution transformers and user clustering, we deal with multi-modality issues in fake news data. First, we use dense embedding and the k-means algorithm to cluster users into groups that are similar to one another. We then develop a local model for each user using their local data. The server then receives the local models of users along with clustering information, and a trusted authority verifies their integrity there. We use two different types of aggregation in place of conventional federated learning systems. The initial step is to combine all users’ models to create a single global model. The second step entails compiling each user's model into a local model of comparable users. Both models are supplied to users, who then select the most suitable model for identifying fake news. By conducting extensive experiments using Twitter data, we demonstrate that the proposed method outperforms various baselines, where it achieves an average accuracy of 0.85 in comparison to others that do not exceed 0.81.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"214-225"},"PeriodicalIF":7.2,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135008935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Scalable Multi-View Clustering via Joint Learning of Many Bipartite Graphs 通过联合学习多双向图实现可扩展的多视图聚类

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-10-16 DOI: 10.1109/TBDATA.2023.3325045

Jinghuan Lao;Dong Huang;Chang-Dong Wang;Jian-Huang Lai

This paper focuses on two limitations to previous multi-view clustering approaches. First, they frequently suffer from quadratic or cubic computational complexity, which restricts their feasibility for large-scale datasets. Second, they often rely on a single graph on each view, yet lack the ability to jointly explore many versatile graph structures for enhanced multi-view information exploration. In light of this, this paper presents a new Scalable Multi-view Clustering via Many Bipartite graphs (SMCMB) approach, which is capable of jointly learning and fusing many bipartite graphs from multiple views while maintaining high efficiency for very large-scale datasets. Different from the one-anchor-set-per-view paradigm, we first produce multiple diversified anchor sets on each view and thus obtain many anchor sets on multiple views, based on which the anchor-based subspace representation learning is enforced and many bipartite graphs are simultaneously learned. Then these bipartite graphs are efficiently partitioned to produce the base clusterings, which are further re-formulated into a unified bipartite graph for the final clustering. Note that SMCMB has almost linear time and space complexity. Extensive experiments on twenty general-scale and large-scale multi-view datasets confirm its superiority in scalability and robustness over the state-of-the-art.

本文重点讨论了以往多视角聚类方法的两个局限性。首先，这些方法通常具有二次或三次计算复杂性，这限制了它们在大规模数据集上的可行性。其次，它们通常依赖于每个视图上的单一图形，但缺乏联合探索多种通用图形结构以增强多视图信息探索的能力。有鉴于此，本文提出了一种新的可扩展多视图聚类（Scalable Multi-view Clustering via Many Bipartite graphs，SMCMB）方法，该方法能够联合学习和融合来自多个视图的多个双叉图，同时保持高效率，适用于超大规模数据集。与每个视图一个锚集的模式不同，我们首先在每个视图上生成多个多样化的锚集，从而在多个视图上获得多个锚集，在此基础上执行基于锚的子空间表示学习，同时学习多个双元图。然后对这些双元图进行有效分割，生成基础聚类，并进一步将其重新表述为统一的双元图，以进行最终聚类。请注意，SMCMB 的时间和空间复杂度几乎是线性的。在二十个一般规模和大规模多视角数据集上进行的广泛实验证实，SMCMB 在可扩展性和鲁棒性方面都优于最先进的技术。

{"title":"Towards Scalable Multi-View Clustering via Joint Learning of Many Bipartite Graphs","authors":"Jinghuan Lao;Dong Huang;Chang-Dong Wang;Jian-Huang Lai","doi":"10.1109/TBDATA.2023.3325045","DOIUrl":"10.1109/TBDATA.2023.3325045","url":null,"abstract":"This paper focuses on two limitations to previous multi-view clustering approaches. First, they frequently suffer from quadratic or cubic computational complexity, which restricts their feasibility for large-scale datasets. Second, they often rely on a single graph on each view, yet lack the ability to jointly explore many versatile graph structures for enhanced multi-view information exploration. In light of this, this paper presents a new Scalable Multi-view Clustering via Many Bipartite graphs (SMCMB) approach, which is capable of jointly learning and fusing many bipartite graphs from multiple views while maintaining high efficiency for very large-scale datasets. Different from the one-anchor-set-per-view paradigm, we first produce multiple diversified anchor sets on each view and thus obtain many anchor sets on multiple views, based on which the anchor-based subspace representation learning is enforced and many bipartite graphs are simultaneously learned. Then these bipartite graphs are efficiently partitioned to produce the base clusterings, which are further re-formulated into a unified bipartite graph for the final clustering. Note that SMCMB has almost linear time and space complexity. Extensive experiments on twenty general-scale and large-scale multi-view datasets confirm its superiority in scalability and robustness over the state-of-the-art.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"77-91"},"PeriodicalIF":7.2,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136372168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

eBoF: Interactive Temporal Correlation Analysis for Ensemble Data Based on Bag-of-Features 基于特征袋的集成数据交互时间相关性分析

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-10-13 DOI: 10.1109/TBDATA.2023.3324482

Zhifei Ding;Jiahao Han;Rongtao Qian;Liming Shen;Siru Chen;Lingxin Yu;Yu Zhu;Richen Liu

We propose eBoF, a novel time-varying ensemble data visualization approach based on the Bag-of-Features (BoF) model. In the eBoF model, we extract a simple and monotone interval from all target variables of ensemble scalar data as a local feature patch. Each local feature of a semantically simple single interval can be defined as a feature patch within the BoF model, with the duration of each interval (i.e., feature patch) serving as its frequency. Feature clusters in ensemble runs are then identified based on the similarity of temporal correlations. eBoF generates clusters along with their probability distributions across all feature patches while preserving the geo-spatial information, which is often lost in traditional topic modeling or clustering algorithms. The probability distribution across different clusters can help to generate reasonable clustering results, evaluated by domain knowledge. We conduct case studies and performance tests to evaluate the eBoF model and gather feedback from domain experts to further refine it. Evaluation results suggest the proposed eBoF can provide insightful and comprehensive evidence on ensemble simulation data analysis.

提出了一种基于特征袋模型的时变集成数据可视化方法。在eBoF模型中，我们从集合标量数据的所有目标变量中提取一个简单单调的区间作为局部特征patch。语义简单的单个区间的每个局部特征可以定义为BoF模型内的一个特征patch，每个区间的持续时间(即特征patch)作为其频率。然后基于时间相关性的相似性来识别集成运行中的特征簇。eBoF生成聚类及其在所有特征块上的概率分布，同时保留了传统主题建模或聚类算法中经常丢失的地理空间信息。不同聚类之间的概率分布有助于生成合理的聚类结果，并通过领域知识进行评估。我们进行案例研究和性能测试来评估eof模型，并从领域专家那里收集反馈以进一步完善它。评价结果表明，该模型可以为集成模拟数据分析提供全面、深刻的依据。

{"title":"eBoF: Interactive Temporal Correlation Analysis for Ensemble Data Based on Bag-of-Features","authors":"Zhifei Ding;Jiahao Han;Rongtao Qian;Liming Shen;Siru Chen;Lingxin Yu;Yu Zhu;Richen Liu","doi":"10.1109/TBDATA.2023.3324482","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3324482","url":null,"abstract":"We propose eBoF, a novel time-varying ensemble data visualization approach based on the Bag-of-Features (BoF) model. In the eBoF model, we extract a simple and monotone interval from all target variables of ensemble scalar data as a local feature patch. Each local feature of a semantically simple single interval can be defined as a feature patch within the BoF model, with the duration of each interval (i.e., feature patch) serving as its frequency. Feature clusters in ensemble runs are then identified based on the similarity of temporal correlations. eBoF generates clusters along with their probability distributions across all feature patches while preserving the geo-spatial information, which is often lost in traditional topic modeling or clustering algorithms. The probability distribution across different clusters can help to generate reasonable clustering results, evaluated by domain knowledge. We conduct case studies and performance tests to evaluate the eBoF model and gather feedback from domain experts to further refine it. Evaluation results suggest the proposed eBoF can provide insightful and comprehensive evidence on ensemble simulation data analysis.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 6","pages":"1726-1737"},"PeriodicalIF":7.2,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138138250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Label-Weighted Graph-Based Learning for Semi-Supervised Classification Under Label Noise 标签噪声下基于标签加权图的半监督分类学习

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-09-27 DOI: 10.1109/TBDATA.2023.3319249

Naiyao Liang;Zuyuan Yang;Junhang Chen;Zhenni Li;Shengli Xie

Graph-based semi-supervised learning (GSSL) is a quite important technology due to its effectiveness in practice. Existing GSSL works often treat the given labels equally and ignore the unbalance importance of labels. In some inaccurate systems, the collected labels usually contain noise (noisy labels) and the methods treating labels equally suffer from the label noise. In this article, we propose a novel label-weighted learning method on graph for semi-supervised classification under label noise, which allows considering the contribution differences of labels. In particular, the label dependency of data is revealed by graph constraints. With the help of this label dependency, the proposed method develops the strategy of adaptive label weight, where label weights are assigned to labels adaptively. Accordingly, an efficient algorithm is developed to solve the proposed optimization objective, where each subproblem has a closed-form solution. Experimental results on a synthetic dataset and several real-world datasets show the advantage of the proposed method, compared to the state-of-the-art methods.

基于图的半监督学习（GSSL）是一项相当重要的技术，因为它在实践中非常有效。现有的 GSSL 作品通常平等对待给定的标签，而忽略标签的不均衡重要性。在一些不准确的系统中，收集到的标签通常包含噪声（噪声标签），平等对待标签的方法会受到标签噪声的影响。在本文中，我们针对标签噪声下的半监督分类提出了一种新颖的基于图的标签加权学习方法，该方法允许考虑标签的贡献差异。特别是，图约束揭示了数据的标签依赖性。借助这种标签依赖性，所提出的方法开发了自适应标签权重策略，即自适应地为标签分配标签权重。因此，我们开发了一种高效算法来解决所提出的优化目标，其中每个子问题都有一个闭式解。在一个合成数据集和几个真实世界数据集上的实验结果表明，与最先进的方法相比，提议的方法具有优势。

引用次数: 0

Legal Transition Sequence Recognition of a Bounded Petri Net Using a Gate Recurrent Unit 使用门递归单元识别有界 Petri 网的合法转换序列

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-09-27 DOI: 10.1109/TBDATA.2023.3319252

Qingtian Zeng;Shuai Guo;Rui Cao;Ziqi Zhao;Hua Duan

The Gate Recurrent Unit (GRU) has a large blank in the application of legal transition sequences for bounded Petri nets. A GRU-based method is proposed for the recognition of bounded Petri net legal transition sequences. First, in a Petri net, legal and non-legal transition sequences are generated according to a certain noise ratio. Then, the legal and non-legal transition sequences are inputted into GRU to recognize the legal transition sequences by encoding the maximum variation sequence length with a uniform length. The proposed method is validated with different Petri nets at different noise ratios and compared with seven widely-known baselines. The results show that the proposed method achieves excellent recognition accuracy and robustness in most situations. Solving the problem that the existing methods cannot recognize the legal transition sequences of Petri nets in real time.

门递归单元（GRU）在有界 Petri 网的合法转换序列应用方面有很大的空白。本文提出了一种基于 GRU 的有界 Petri 网合法转换序列识别方法。首先，在 Petri 网中，按照一定的噪声比生成合法过渡序列和非合法过渡序列。然后，将合法和非法过渡序列输入 GRU，通过将最大变化序列长度编码为统一长度来识别合法过渡序列。我们利用不同噪声比的 Petri 网对所提出的方法进行了验证，并与七种广为人知的基线方法进行了比较。结果表明，所提出的方法在大多数情况下都能达到出色的识别准确率和鲁棒性。解决了现有方法无法实时识别 Petri 网合法转换序列的问题。

引用次数: 0

A Multi-Aspect Neural Tensor Factorization Framework for Patent Litigation Prediction 用于专利诉讼预测的多视角神经张量因子化框架

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-09-20 DOI: 10.1109/TBDATA.2023.3313030

Han Wu;Guanqi Zhu;Qi Liu;Hengshu Zhu;Hao Wang;Hongke Zhao;Chuanren Liu;Enhong Chen;Hui Xiong

Patent litigation is an expensive and time-consuming legal process. To reduce costs, companies can proactively manage patents using predictive analysis to identify potential plaintiffs, defendants, and patents that may lead to litigation. However, there has been limited progress in predicting patent litigation due to the scarcity of lawsuits, the complexities of intentions, and the diversity of litigation characteristics. To this end, in this paper, we summarize the major causes of patent litigation into multiple aspects: the complex relations among plaintiffs, defendants and patents as well as the diverse content information from them. Along this line, we propose a Multi-aspect Neural Tensor Factorization (MANTF) framework for patent litigation prediction. First, a Pair-wise Tensor Factorization (PTF) module is designed to capture the complex relations among plaintiffs, defendants and patents inherent in a three-dimensional tensor, which will produce factorized latent vectors for companies and patents with pair-wise ranking estimators. Then, to better represent the patents and companies as an aid for PTF, we design a Patent Embedding Network (PEN) module and a Mask Company Embedding Network (MCEN) module to generate content-aware embedding for them, where PEN represents patents based on their meta, textual and graphical features, and MCEN represents companies by integrating their intrinsic features and competitions. Next, to integrate these three modules together, we leverage a Gaussian prior on the difference between factorized representations and content-aware embedding, and train MANTF in an end-to-end way. In the end, final predictions for patent litigation, i.e., the potentially litigated plaintiffs, defendants and patents, can be made with the well-trained model. We conduct extensive experiments on two real-world datasets, whose results prove that MANTF not only helps predict potential patent litigation but also shows robustness under various data sparse situations.

专利诉讼是一项昂贵而耗时的法律程序。为了降低成本，公司可以利用预测分析来识别潜在的原告、被告和可能导致诉讼的专利，从而积极主动地管理专利。然而，由于诉讼数量稀少、意图复杂、诉讼特征多样，在预测专利诉讼方面取得的进展有限。为此，我们在本文中将专利诉讼的主要原因归纳为多个方面：原告、被告和专利之间的复杂关系，以及来自他们的多样化内容信息。根据这一思路，我们提出了一种用于专利诉讼预测的多方面神经张量因子化（MANTF）框架。首先，我们设计了一个配对张量因式分解（PTF）模块，以捕捉三维张量中原告、被告和专利之间的复杂内在关系，从而生成具有配对排序估计器的公司和专利因式化潜在向量。然后，为了更好地表示专利和公司作为 PTF 的辅助工具，我们设计了专利嵌入网络（PEN）模块和掩码公司嵌入网络（MCEN）模块，为它们生成内容感知嵌入，其中 PEN 根据元、文本和图形特征表示专利，MCEN 通过整合公司的内在特征和竞争来表示公司。接下来，为了将这三个模块整合在一起，我们利用高斯先验对因子化表示和内容感知嵌入之间的差异进行了分析，并以端到端的方式对 MANTF 进行了训练。最后，通过训练有素的模型可以对专利诉讼进行最终预测，即预测可能提起诉讼的原告、被告和专利。我们在两个真实世界的数据集上进行了广泛的实验，结果证明 MANTF 不仅有助于预测潜在的专利诉讼，而且在各种数据稀疏的情况下也表现出了鲁棒性。

{"title":"A Multi-Aspect Neural Tensor Factorization Framework for Patent Litigation Prediction","authors":"Han Wu;Guanqi Zhu;Qi Liu;Hengshu Zhu;Hao Wang;Hongke Zhao;Chuanren Liu;Enhong Chen;Hui Xiong","doi":"10.1109/TBDATA.2023.3313030","DOIUrl":"10.1109/TBDATA.2023.3313030","url":null,"abstract":"Patent litigation is an expensive and time-consuming legal process. To reduce costs, companies can proactively manage patents using predictive analysis to identify potential plaintiffs, defendants, and patents that may lead to litigation. However, there has been limited progress in predicting patent litigation due to the scarcity of lawsuits, the complexities of intentions, and the diversity of litigation characteristics. To this end, in this paper, we summarize the major causes of patent litigation into multiple aspects: the complex relations among plaintiffs, defendants and patents as well as the diverse content information from them. Along this line, we propose a Multi-aspect Neural Tensor Factorization (MANTF) framework for patent litigation prediction. First, a Pair-wise Tensor Factorization (PTF) module is designed to capture the complex relations among plaintiffs, defendants and patents inherent in a three-dimensional tensor, which will produce factorized latent vectors for companies and patents with pair-wise ranking estimators. Then, to better represent the patents and companies as an aid for PTF, we design a Patent Embedding Network (PEN) module and a Mask Company Embedding Network (MCEN) module to generate content-aware embedding for them, where PEN represents patents based on their meta, textual and graphical features, and MCEN represents companies by integrating their intrinsic features and competitions. Next, to integrate these three modules together, we leverage a Gaussian prior on the difference between factorized representations and content-aware embedding, and train MANTF in an end-to-end way. In the end, final predictions for patent litigation, i.e., the potentially litigated plaintiffs, defendants and patents, can be made with the well-trained model. We conduct extensive experiments on two real-world datasets, whose results prove that MANTF not only helps predict potential patent litigation but also shows robustness under various data sparse situations.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"35-54"},"PeriodicalIF":7.2,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135597577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatial-Temporal Contrasting for Fine-Grained Urban Flow Inference 细粒度城市流量推理的时空对比研究

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-09-18 DOI: 10.1109/TBDATA.2023.3316471

Xovee Xu;Zhiyuan Wang;Qiang Gao;Ting Zhong;Bei Hui;Fan Zhou;Goce Trajcevski

Fine-grained urban flow inference (FUFI) problem aims to infer the fine-grained flow maps from coarse-grained ones, benefiting various smart-city applications by reducing electricity, maintenance, and operation costs. Existing models use techniques from image super-resolution and achieve good performance in FUFI. However, they often rely on supervised learning with a large amount of training data, and often lack generalization capability and face overfitting. We present a new solution: Spatial-Temporal Contrasting for Fine-Grained Urban Flow Inference (STCF). It consists of (i) two pre-training networks for spatial-temporal contrasting between flow maps; and (ii) one coupled fine-tuning network for fusing learned features. By attracting spatial-temporally similar flow maps while distancing dissimilar ones within the representation space, STCF enhances efficiency and performance. Comprehensive experiments on two large-scale, real-world urban flow datasets reveal that STCF reduces inference error by up to 13.5%, requiring significantly fewer data and model parameters than prior arts.

细粒度的城市流推断(FUFI)问题旨在从粗粒度的流图中推断出细粒度的流图，通过降低电力、维护和运营成本，使各种智慧城市应用受益。现有模型采用图像超分辨率技术，在FUFI中取得了良好的性能。然而，它们往往依赖于大量训练数据的监督学习，往往缺乏泛化能力，面临过拟合。我们提出了一个新的解决方案:时空对比的细粒度城市流推断(STCF)。它由两个用于流图时空对比的预训练网络组成;(ii)一个用于融合学习特征的耦合微调网络。STCF通过吸引时空相似的流程图，同时在表示空间内隔离不相似的流程图，提高了效率和性能。在两个大规模的真实城市流量数据集上进行的综合实验表明，STCF将推理误差降低了13.5%，所需的数据和模型参数比现有技术少得多。

{"title":"Spatial-Temporal Contrasting for Fine-Grained Urban Flow Inference","authors":"Xovee Xu;Zhiyuan Wang;Qiang Gao;Ting Zhong;Bei Hui;Fan Zhou;Goce Trajcevski","doi":"10.1109/TBDATA.2023.3316471","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3316471","url":null,"abstract":"Fine-grained urban flow inference (FUFI) problem aims to infer the fine-grained flow maps from coarse-grained ones, benefiting various smart-city applications by reducing electricity, maintenance, and operation costs. Existing models use techniques from image super-resolution and achieve good performance in FUFI. However, they often rely on supervised learning with a large amount of training data, and often lack generalization capability and face overfitting. We present a new solution: \u0000<underline>S\u0000patial-\u0000<underline>T\u0000emporal \u0000<underline>C\u0000ontrasting for Fine-Grained Urban \u0000<underline>F\u0000low Inference (STCF). It consists of (i) two pre-training networks for spatial-temporal contrasting between flow maps; and (ii) one coupled fine-tuning network for fusing learned features. By attracting \u0000<italic>spatial-temporally similar\u0000 flow maps while distancing dissimilar ones within the representation space, STCF enhances efficiency and performance. Comprehensive experiments on two large-scale, real-world urban flow datasets reveal that STCF reduces inference error by up to 13.5%, requiring significantly fewer data and model parameters than prior arts.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 6","pages":"1711-1725"},"PeriodicalIF":7.2,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138138249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PHAED: A Speaker-Aware Parallel Hierarchical Attentive Encoder-Decoder Model for Multi-Turn Dialogue Generation PHAED：用于多轮对话生成的说话者感知并行分层注意力编码器-解码器模型

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-09-18 DOI: 10.1109/TBDATA.2023.3316472

Zihao Wang;Ming Jiang;Junli Wang

This article presents a novel open-domain dialogue generation model emphasizing the differentiation of speakers in multi-turn conversations. Differing from prior work that treats the conversation history as a long text, we argue that capturing relative social relations among utterances (i.e., generated by either the same speaker or different persons) benefits the machine capturing fine-grained context information from a conversation history to improve context coherence in the generated response. Given that, we propose a Parallel Hierarchical Attentive Encoder-Decoder (PHAED) model that can effectively leverage conversation history by modeling each utterance with the awareness of its speaker and contextual associations with the same speaker's previous messages. Specifically, to distinguish the speaker roles over a multi-turn conversation (involving two speakers), we regard the utterances from one speaker as responses and those from the other as queries. After understanding queries via hierarchical encoder with inner-query and inter-query encodings, transformer-xl style decoder reuses the hidden states of previously generated responses to generate a new response. Our empirical results with three large-scale benchmarks show that PHAED significantly outperforms baseline models on both automatic and human evaluations. Furthermore, our ablation study shows that dialogue models with speaker tokens can generally decrease the possibility of generating non-coherent responses.

本文提出了一种新颖的开放域对话生成模型，强调多轮对话中说话者的区别。与之前将对话历史作为长文本处理的工作不同，我们认为，捕捉语句（即由同一说话人或不同人生成）之间的相对社会关系有利于机器从对话历史中捕捉细粒度的上下文信息，从而提高生成回复的上下文一致性。有鉴于此，我们提出了一种并行分层注意力编码器-解码器（PHAED）模型，该模型可以有效地利用对话历史，通过对每个语句进行建模，了解其说话者以及与同一说话者之前信息的上下文关联。具体来说，为了区分多轮对话（涉及两个说话者）中说话者的角色，我们将其中一个说话者的话语视为回应，而将另一个说话者的话语视为询问。在通过带有内部查询和查询间编码的分层编码器理解查询后，转换器-xl 风格解码器会重新使用之前生成的回复的隐藏状态来生成新的回复。我们在三个大型基准测试中的实证结果表明，PHAED 在自动和人工评估中的表现都明显优于基准模型。此外，我们的消融研究表明，带有说话人标记的对话模型通常可以降低产生非一致性应答的可能性。

{"title":"PHAED: A Speaker-Aware Parallel Hierarchical Attentive Encoder-Decoder Model for Multi-Turn Dialogue Generation","authors":"Zihao Wang;Ming Jiang;Junli Wang","doi":"10.1109/TBDATA.2023.3316472","DOIUrl":"10.1109/TBDATA.2023.3316472","url":null,"abstract":"This article presents a novel open-domain dialogue generation model emphasizing the differentiation of speakers in multi-turn conversations. Differing from prior work that treats the conversation history as a long text, we argue that capturing relative social relations among utterances (i.e., generated by either the same speaker or different persons) benefits the machine capturing fine-grained context information from a conversation history to improve context coherence in the generated response. Given that, we propose a Parallel Hierarchical Attentive Encoder-Decoder (PHAED) model that can effectively leverage conversation history by modeling each utterance with the awareness of its speaker and contextual associations with the same speaker's previous messages. Specifically, to distinguish the speaker roles over a multi-turn conversation (involving two speakers), we regard the utterances from one speaker as responses and those from the other as queries. After understanding queries via hierarchical encoder with inner-query and inter-query encodings, transformer-xl style decoder reuses the hidden states of previously generated responses to generate a new response. Our empirical results with three large-scale benchmarks show that PHAED significantly outperforms baseline models on both automatic and human evaluations. Furthermore, our ablation study shows that dialogue models with speaker tokens can generally decrease the possibility of generating non-coherent responses.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"23-34"},"PeriodicalIF":7.2,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135501710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0