Pub Date : 2023-10-16DOI: 10.1109/TBDATA.2023.3325045
Jinghuan Lao;Dong Huang;Chang-Dong Wang;Jian-Huang Lai
This paper focuses on two limitations to previous multi-view clustering approaches. First, they frequently suffer from quadratic or cubic computational complexity, which restricts their feasibility for large-scale datasets. Second, they often rely on a single graph on each view, yet lack the ability to jointly explore many versatile graph structures for enhanced multi-view information exploration. In light of this, this paper presents a new Scalable Multi-view Clustering via Many Bipartite graphs (SMCMB) approach, which is capable of jointly learning and fusing many bipartite graphs from multiple views while maintaining high efficiency for very large-scale datasets. Different from the one-anchor-set-per-view paradigm, we first produce multiple diversified anchor sets on each view and thus obtain many anchor sets on multiple views, based on which the anchor-based subspace representation learning is enforced and many bipartite graphs are simultaneously learned. Then these bipartite graphs are efficiently partitioned to produce the base clusterings, which are further re-formulated into a unified bipartite graph for the final clustering. Note that SMCMB has almost linear time and space complexity. Extensive experiments on twenty general-scale and large-scale multi-view datasets confirm its superiority in scalability and robustness over the state-of-the-art.
本文重点讨论了以往多视角聚类方法的两个局限性。首先,这些方法通常具有二次或三次计算复杂性,这限制了它们在大规模数据集上的可行性。其次,它们通常依赖于每个视图上的单一图形,但缺乏联合探索多种通用图形结构以增强多视图信息探索的能力。有鉴于此,本文提出了一种新的可扩展多视图聚类(Scalable Multi-view Clustering via Many Bipartite graphs,SMCMB)方法,该方法能够联合学习和融合来自多个视图的多个双叉图,同时保持高效率,适用于超大规模数据集。与每个视图一个锚集的模式不同,我们首先在每个视图上生成多个多样化的锚集,从而在多个视图上获得多个锚集,在此基础上执行基于锚的子空间表示学习,同时学习多个双元图。然后对这些双元图进行有效分割,生成基础聚类,并进一步将其重新表述为统一的双元图,以进行最终聚类。请注意,SMCMB 的时间和空间复杂度几乎是线性的。在二十个一般规模和大规模多视角数据集上进行的广泛实验证实,SMCMB 在可扩展性和鲁棒性方面都优于最先进的技术。
{"title":"Towards Scalable Multi-View Clustering via Joint Learning of Many Bipartite Graphs","authors":"Jinghuan Lao;Dong Huang;Chang-Dong Wang;Jian-Huang Lai","doi":"10.1109/TBDATA.2023.3325045","DOIUrl":"10.1109/TBDATA.2023.3325045","url":null,"abstract":"This paper focuses on two limitations to previous multi-view clustering approaches. First, they frequently suffer from quadratic or cubic computational complexity, which restricts their feasibility for large-scale datasets. Second, they often rely on a single graph on each view, yet lack the ability to jointly explore many versatile graph structures for enhanced multi-view information exploration. In light of this, this paper presents a new Scalable Multi-view Clustering via Many Bipartite graphs (SMCMB) approach, which is capable of jointly learning and fusing many bipartite graphs from multiple views while maintaining high efficiency for very large-scale datasets. Different from the one-anchor-set-per-view paradigm, we first produce multiple diversified anchor sets on each view and thus obtain many anchor sets on multiple views, based on which the anchor-based subspace representation learning is enforced and many bipartite graphs are simultaneously learned. Then these bipartite graphs are efficiently partitioned to produce the base clusterings, which are further re-formulated into a unified bipartite graph for the final clustering. Note that SMCMB has almost linear time and space complexity. Extensive experiments on twenty general-scale and large-scale multi-view datasets confirm its superiority in scalability and robustness over the state-of-the-art.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"77-91"},"PeriodicalIF":7.2,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136372168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-13DOI: 10.1109/TBDATA.2023.3324482
Zhifei Ding;Jiahao Han;Rongtao Qian;Liming Shen;Siru Chen;Lingxin Yu;Yu Zhu;Richen Liu
We propose eBoF, a novel time-varying ensemble data visualization approach based on the Bag-of-Features (BoF) model. In the eBoF model, we extract a simple and monotone interval from all target variables of ensemble scalar data as a local feature patch. Each local feature of a semantically simple single interval can be defined as a feature patch within the BoF model, with the duration of each interval (i.e., feature patch) serving as its frequency. Feature clusters in ensemble runs are then identified based on the similarity of temporal correlations. eBoF generates clusters along with their probability distributions across all feature patches while preserving the geo-spatial information, which is often lost in traditional topic modeling or clustering algorithms. The probability distribution across different clusters can help to generate reasonable clustering results, evaluated by domain knowledge. We conduct case studies and performance tests to evaluate the eBoF model and gather feedback from domain experts to further refine it. Evaluation results suggest the proposed eBoF can provide insightful and comprehensive evidence on ensemble simulation data analysis.
{"title":"eBoF: Interactive Temporal Correlation Analysis for Ensemble Data Based on Bag-of-Features","authors":"Zhifei Ding;Jiahao Han;Rongtao Qian;Liming Shen;Siru Chen;Lingxin Yu;Yu Zhu;Richen Liu","doi":"10.1109/TBDATA.2023.3324482","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3324482","url":null,"abstract":"We propose eBoF, a novel time-varying ensemble data visualization approach based on the Bag-of-Features (BoF) model. In the eBoF model, we extract a simple and monotone interval from all target variables of ensemble scalar data as a local feature patch. Each local feature of a semantically simple single interval can be defined as a feature patch within the BoF model, with the duration of each interval (i.e., feature patch) serving as its frequency. Feature clusters in ensemble runs are then identified based on the similarity of temporal correlations. eBoF generates clusters along with their probability distributions across all feature patches while preserving the geo-spatial information, which is often lost in traditional topic modeling or clustering algorithms. The probability distribution across different clusters can help to generate reasonable clustering results, evaluated by domain knowledge. We conduct case studies and performance tests to evaluate the eBoF model and gather feedback from domain experts to further refine it. Evaluation results suggest the proposed eBoF can provide insightful and comprehensive evidence on ensemble simulation data analysis.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 6","pages":"1726-1737"},"PeriodicalIF":7.2,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138138250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph-based semi-supervised learning (GSSL) is a quite important technology due to its effectiveness in practice. Existing GSSL works often treat the given labels equally and ignore the unbalance importance of labels. In some inaccurate systems, the collected labels usually contain noise (noisy labels) and the methods treating labels equally suffer from the label noise. In this article, we propose a novel label-weighted learning method on graph for semi-supervised classification under label noise, which allows considering the contribution differences of labels. In particular, the label dependency of data is revealed by graph constraints. With the help of this label dependency, the proposed method develops the strategy of adaptive label weight, where label weights are assigned to labels adaptively. Accordingly, an efficient algorithm is developed to solve the proposed optimization objective, where each subproblem has a closed-form solution. Experimental results on a synthetic dataset and several real-world datasets show the advantage of the proposed method, compared to the state-of-the-art methods.
{"title":"Label-Weighted Graph-Based Learning for Semi-Supervised Classification Under Label Noise","authors":"Naiyao Liang;Zuyuan Yang;Junhang Chen;Zhenni Li;Shengli Xie","doi":"10.1109/TBDATA.2023.3319249","DOIUrl":"10.1109/TBDATA.2023.3319249","url":null,"abstract":"Graph-based semi-supervised learning (GSSL) is a quite important technology due to its effectiveness in practice. Existing GSSL works often treat the given labels equally and ignore the unbalance importance of labels. In some inaccurate systems, the collected labels usually contain noise (noisy labels) and the methods treating labels equally suffer from the label noise. In this article, we propose a novel label-weighted learning method on graph for semi-supervised classification under label noise, which allows considering the contribution differences of labels. In particular, the label dependency of data is revealed by graph constraints. With the help of this label dependency, the proposed method develops the strategy of adaptive label weight, where label weights are assigned to labels adaptively. Accordingly, an efficient algorithm is developed to solve the proposed optimization objective, where each subproblem has a closed-form solution. Experimental results on a synthetic dataset and several real-world datasets show the advantage of the proposed method, compared to the state-of-the-art methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"55-65"},"PeriodicalIF":7.2,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135793726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Gate Recurrent Unit (GRU) has a large blank in the application of legal transition sequences for bounded Petri nets. A GRU-based method is proposed for the recognition of bounded Petri net legal transition sequences. First, in a Petri net, legal and non-legal transition sequences are generated according to a certain noise ratio. Then, the legal and non-legal transition sequences are inputted into GRU to recognize the legal transition sequences by encoding the maximum variation sequence length with a uniform length. The proposed method is validated with different Petri nets at different noise ratios and compared with seven widely-known baselines. The results show that the proposed method achieves excellent recognition accuracy and robustness in most situations. Solving the problem that the existing methods cannot recognize the legal transition sequences of Petri nets in real time.
门递归单元(GRU)在有界 Petri 网的合法转换序列应用方面有很大的空白。本文提出了一种基于 GRU 的有界 Petri 网合法转换序列识别方法。首先,在 Petri 网中,按照一定的噪声比生成合法过渡序列和非合法过渡序列。然后,将合法和非法过渡序列输入 GRU,通过将最大变化序列长度编码为统一长度来识别合法过渡序列。我们利用不同噪声比的 Petri 网对所提出的方法进行了验证,并与七种广为人知的基线方法进行了比较。结果表明,所提出的方法在大多数情况下都能达到出色的识别准确率和鲁棒性。解决了现有方法无法实时识别 Petri 网合法转换序列的问题。
{"title":"Legal Transition Sequence Recognition of a Bounded Petri Net Using a Gate Recurrent Unit","authors":"Qingtian Zeng;Shuai Guo;Rui Cao;Ziqi Zhao;Hua Duan","doi":"10.1109/TBDATA.2023.3319252","DOIUrl":"10.1109/TBDATA.2023.3319252","url":null,"abstract":"The Gate Recurrent Unit (GRU) has a large blank in the application of legal transition sequences for bounded Petri nets. A GRU-based method is proposed for the recognition of bounded Petri net legal transition sequences. First, in a Petri net, legal and non-legal transition sequences are generated according to a certain noise ratio. Then, the legal and non-legal transition sequences are inputted into GRU to recognize the legal transition sequences by encoding the maximum variation sequence length with a uniform length. The proposed method is validated with different Petri nets at different noise ratios and compared with seven widely-known baselines. The results show that the proposed method achieves excellent recognition accuracy and robustness in most situations. Solving the problem that the existing methods cannot recognize the legal transition sequences of Petri nets in real time.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"66-76"},"PeriodicalIF":7.2,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135793555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-20DOI: 10.1109/TBDATA.2023.3313030
Han Wu;Guanqi Zhu;Qi Liu;Hengshu Zhu;Hao Wang;Hongke Zhao;Chuanren Liu;Enhong Chen;Hui Xiong
Patent litigation is an expensive and time-consuming legal process. To reduce costs, companies can proactively manage patents using predictive analysis to identify potential plaintiffs, defendants, and patents that may lead to litigation. However, there has been limited progress in predicting patent litigation due to the scarcity of lawsuits, the complexities of intentions, and the diversity of litigation characteristics. To this end, in this paper, we summarize the major causes of patent litigation into multiple aspects: the complex relations among plaintiffs, defendants and patents as well as the diverse content information from them. Along this line, we propose a Multi-aspect Neural Tensor Factorization (MANTF) framework for patent litigation prediction. First, a Pair-wise Tensor Factorization (PTF) module is designed to capture the complex relations among plaintiffs, defendants and patents inherent in a three-dimensional tensor, which will produce factorized latent vectors for companies and patents with pair-wise ranking estimators. Then, to better represent the patents and companies as an aid for PTF, we design a Patent Embedding Network (PEN) module and a Mask Company Embedding Network (MCEN) module to generate content-aware embedding for them, where PEN represents patents based on their meta, textual and graphical features, and MCEN represents companies by integrating their intrinsic features and competitions. Next, to integrate these three modules together, we leverage a Gaussian prior on the difference between factorized representations and content-aware embedding, and train MANTF in an end-to-end way. In the end, final predictions for patent litigation, i.e., the potentially litigated plaintiffs, defendants and patents, can be made with the well-trained model. We conduct extensive experiments on two real-world datasets, whose results prove that MANTF not only helps predict potential patent litigation but also shows robustness under various data sparse situations.
{"title":"A Multi-Aspect Neural Tensor Factorization Framework for Patent Litigation Prediction","authors":"Han Wu;Guanqi Zhu;Qi Liu;Hengshu Zhu;Hao Wang;Hongke Zhao;Chuanren Liu;Enhong Chen;Hui Xiong","doi":"10.1109/TBDATA.2023.3313030","DOIUrl":"10.1109/TBDATA.2023.3313030","url":null,"abstract":"Patent litigation is an expensive and time-consuming legal process. To reduce costs, companies can proactively manage patents using predictive analysis to identify potential plaintiffs, defendants, and patents that may lead to litigation. However, there has been limited progress in predicting patent litigation due to the scarcity of lawsuits, the complexities of intentions, and the diversity of litigation characteristics. To this end, in this paper, we summarize the major causes of patent litigation into multiple aspects: the complex relations among plaintiffs, defendants and patents as well as the diverse content information from them. Along this line, we propose a Multi-aspect Neural Tensor Factorization (MANTF) framework for patent litigation prediction. First, a Pair-wise Tensor Factorization (PTF) module is designed to capture the complex relations among plaintiffs, defendants and patents inherent in a three-dimensional tensor, which will produce factorized latent vectors for companies and patents with pair-wise ranking estimators. Then, to better represent the patents and companies as an aid for PTF, we design a Patent Embedding Network (PEN) module and a Mask Company Embedding Network (MCEN) module to generate content-aware embedding for them, where PEN represents patents based on their meta, textual and graphical features, and MCEN represents companies by integrating their intrinsic features and competitions. Next, to integrate these three modules together, we leverage a Gaussian prior on the difference between factorized representations and content-aware embedding, and train MANTF in an end-to-end way. In the end, final predictions for patent litigation, i.e., the potentially litigated plaintiffs, defendants and patents, can be made with the well-trained model. We conduct extensive experiments on two real-world datasets, whose results prove that MANTF not only helps predict potential patent litigation but also shows robustness under various data sparse situations.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 1","pages":"35-54"},"PeriodicalIF":7.2,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135597577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fine-grained urban flow inference (FUFI) problem aims to infer the fine-grained flow maps from coarse-grained ones, benefiting various smart-city applications by reducing electricity, maintenance, and operation costs. Existing models use techniques from image super-resolution and achieve good performance in FUFI. However, they often rely on supervised learning with a large amount of training data, and often lack generalization capability and face overfitting. We present a new solution: S